Statistics Manual by IFMSA-Egypt

IFMSA-Egypt Statistics Manual

2 INDEX Introduction 3 Variables 5 Descriptive Studies 8 Inferential Statistics: Introduction 15 Inferential Statistics: Parametric Tests 22 Inferential Statistics: Non-Parametric Tests 28 Some Study Design Specific Statistics 32 Reference 35 SWG 36

What is statistics and why do we use it?

Try googling the word statistics, you will find something along the lines of: statistics is the science of collecting, presenting and analyzing the data collected from a sample to get information about the population. Even though the above definition includes the key themes of statistics, it does not do statistics justice. Statistics is the discipline aiming to predict the future and deal with uncertainty. It is no secret that our world contains an infinity of unknowns, making it expectedly unpredictable. Statistics is our attempt to make sense, and even comprehend our uncertain reality.

A multitude of misconceptions beset the field of statistics. One such misconception is portraying statisticians with a calculator in one hand and a table of numbers in the other, computing copious amounts of data. This personification of dullness and boredom was right just several decades ago.

The dawn of technology and computers has completely transformed a statistician to one who uses software to analyze data. Humans are continuously discovering and inventing, while simultaneously looking for new methods to employ these discoveries and inventions in statistics.

The field of statistics has a poor reputation of lacking creativity, and just abiding by sets of statistical tests to perform in specific study designs. In truth, doing statistics is a complex art. Yes, there are a bunch of tests; but these are just your tools. In the end, though statistics is incorporated in every field, we are going to focus on biostatistics in the following sections.

Research Process

In previous sections, you have read extensively about research methodology. Here, we are going to discuss research methods conversely in statistical terms.

Research Problem & Hypothesis

Research is plainly our attempt to understand the world around us. Though we might wish to conduct research on everything at once, it is too unfeasible. Instead, we focus on a specific idea and research it. Ideas usually begin with an observation that leads to the identification of a problem on which we conduct research. We then need to break the problem down into questions: what is the cause, how to diagnose… etc; then we choose a specific question to focus on.

INTRODUCTION

Afterwards, we transfer this question into a statistically measurable question termed a hypothesis. Depending on the formulated hypothesis, we select the statistical methods to handle it.

Example of Hypothesis: “Earlier deaths are caused by increased exposure to radiation.”

Population and Sample

The aim of research is to reach explanations that apply to entire populations (as known as generalizable facts). Ideally, to achieve this, it would be best to test a prediction on the population in question. Practically, however, it is usually too expensive, unethical, or inefficient to carry out a study on the entire population. Think of trying to find the average blood pressure of the entire country! Although we cannot test the entire population, we can still test a much smaller group of entities from this population, by taking a sample. Intuitively, the sample’s values will not equal the population’s: thus, we only “estimate” the population’s value. Statistics then enable us to generalize the result of our sample to our population.

Variables and Analyses

Remember our hypothesis? To test it, we break it down to certain variables that we want to measure. Then we summarize those variables in numbers and graphs, like bar charts, as we will learn in upcoming chapters.

Interpretation of Results

After doing all your statistics, use the resulting data in ways that serve your initial goal of the study. This is the function of inference and inferential statistics, in which we intend to draw conclusions from the results. This occurs through testing our prediction, and either the data supports our prediction, or opposes it. However, what most people, including statisticians, miss is that data has a value beyond just drawing conclusions for our predictions. In medicine, we do research and statistics in order to discover new information that will help us treat patients or reduce suffering in the world. Therefore, we should not view the data separately, but within the context of their application. Accordingly, we can also optimize our statistical analyses and research methodology to serve this application.

In Chapter 2, we will be discussing inferential statistics and how modern science integrates statistics into clinical application.

If there is one thing to understand about statistics, it is that its job is to translate science into numbers, and then numbers back into science

VARIABLES

While generating theories and hypotheses, we must know the importance of our collected data in testing hypotheses or deciding between competing theories. Therefore, we need to decide on two things:

1. What to measure?

2. How to measure it?

Variables are data items whose values can change/vary between individuals or testing conditions (e.g. weight, RBC count, severity of disease, and tumor markers’ concentrations).

On the other hand, the values of the other data type, Constants, are always the same.

Types of Variables

Variables are usually classified according to two standards:

1. The causal relationship.

2. The level of measurement.

The Causal Relationship

The variables are classified into two types with regard to causality: Dependent and Independent variables. In other words, you can think of them in terms of cause and effect: an Independent variable is the variable the researcher thinks is, or has a significant influence on, the cause, while a Dependent Variable is the effect, or a consequence thereof.

For example, if you plan to study the effects of smoking on lung cancer:

 The Independent Variable is a better fit to represent smoking.

 The Dependent Variable will be representing lung cancer, as it is dependent on smoking.

If there is a third variable in a study examining a potential cause-and-effect relationship, we call that variable a Confounding Factor. For example, a study investigating the association between obesity and heart disease might be “confounded” by age, diet, or smoking status.

It's difficult to separate the true effect of the independent variable from the effect of the confounding variable. Therefore, in your research design, it is

important to identify potential confounding variables and plan how you will reduce their impact on your statistical methods.

Levels of Measurement

Depending on whether you are measuring your variables qualitatively or quantitatively, you should choose a class of variables fitting for the type of data you have.

Categorical Variables are more fitted for qualitative data, while Numerical Variables are better to describe quantitative data.

Categorical Variables take Category/Label values and measure the frequencies of each category. They do NOT establish a numeric difference between categories.

Categorical variables are often further classified as either:

Type Nominal Variables

Characteristics

 Can NOT be logically ordered or ranked.

 Dichotomous (Binary) if there are only two categories like (yes/no or alive/dead).

 Polytomous if there are more than two categories.

Ordinal Variables

 Can be logically ordered or ranked.

Example Blood Group. Coma Grades (1-5 grades).

Numerical Variables, on the other hand, have a numeric difference between their values and are often further classified as either:

Type Continuous Variables

Characteristics

 Have infinite intermediate numbers along a specified interval.

 Measurable; therefore, they have “Units of Measurement.”

Discrete Variables

 Countable but NOT infinitely intervalized.

Example

 Blood Pressure

 Temperature

 Glucose Level.

 Number of Children

 Previous MI Attacks.

DESCRIPTIVE STUDIES

What is the Importance of Descriptive Statistics?

Imagine that we have a dataset containing the ages of a thousand people and you need to read them all. It is impractical to read the thousand recorded data points; and even if you did, it is going to be hard to summarize the important trends in the data and reach a conclusion. If there are any interesting features in the data, they remain hidden from us.

In this chapter, we are going to describe some methods for organizing and presenting the data; so that we can more easily answer our questions of interest. Collectively, these methods are called Descriptive Statistics.

Central Tendency

Central Tendency is a summary statistic that measures the center point of a dataset. These demonstrate where most values in a distribution fall and are often referred to as a distribution's central position. In statistics, the mean, median, and mode are the three most common measures of central tendency. Each of these measures has its special uses, but the mean is the most important average in both descriptive and inferential statistics.

Measure Definition Example Mean

The total of numbers divided by how many numbers there are.

Median

The number which is in the middle, or the middle value (50th percentile).

In the following dataset: 11, 4, 11, 19, 11, 7, 8, 21, 7

The total: 11+4+11+ 19+11+7+8+21+7=99

There are 9 numbers

The mean= 99/9 = 11

In the following dataset: 18 12 4 26 14 18 5

Data arranged from minimum to maximum: 4 5 12 14 18 18 26

Median: 14

The number that appears the most. In the following dataset: 11 4 11 19 11 7 8 21 7

Mode= 11 (As it appeared three times in the data)

Averages for Qualitative and Ranked Data

 The mode is always appropriate for Nominal Data (e.g. Gender and Nationality).

 Percentage and frequency are used to describe Categorical (Qualitative) Data (e.g. 80% of the sample have type 1 diabetes; or 93 out of 112 participants have type 1 diabetes).

The Normal/Gaussian Distribution

Most biological phenomena are normally distributed. Meaning that the probability of randomly drawing a number from a dataset that measures one of those phenomena will always have a tendency to increase at a certain value (x: the mean of the dataset; the Central Tendency).

This probability will decrease in either direction away from that value on the x axis (also called horizontal spread). This is best represented by the Gaussian Curve (See Figure 1).

is Public Domain.

Using this principle to describe datasets hints at their characteristics. For example, most of the population is located at the center where the blood pressure is normal, while a small portion of them is located at the right segment of the curve (hypertensive) and at the left segment (hypotensive).

Gaussians have defining properties that allow us to use them to describe the data efficiently.

These properties are all incorporated in the mathematical equation of their curve, which always has the symmetrical bell-shaped form describing a continuous

Mode

Figure 1. The Gaussian Curve. Chart

variable, peaking in the point midway along the horizontal spread and then tapering off gradually in either direction from the peak (without actually touching the horizontal axis; since, in theory, the tails of a normal curve extend infinitely). The last stunning feature of Gaussians is that the mean, median, and mode all have the same value, located midway of the horizontal spread.

Dispersion

Variability refers to how a set of data is spread out and provides a way to explain how sets of data differ. In a data set, the four main ways to explain variability are: range, interquartile range, variance, and, most importantly, standard deviation.

Range

The range is the interval between the largest and smallest scores.

Example: In the two datasets below, dataset 1 has a range of 38-20 while dataset 2 has a range of 52-11. Dataset 2 has a broader range and, hence, more variability than dataset 1 (even both datasets have the same mean). Dataset

Interquartile Range

The Interquartile Range (IQR) is almost the same as the range, only instead of stating the range for the whole data set, you’re giving the amount for the “middle fifty”. It’s sometimes more useful than the range because it tells you where most of your values lie. The formula is IQR = Q3 – Q1, where Q3 is the third quartile (75th percentile) and Q1 is the first quartile (25th percentile).

Example: Figure 2 shows the IQR, represented by the box. The whiskers (the lines coming out from either side of the box) represent the first quarter (min) of the data and the last quarter (max).

20 21 22 25 26

One

29 33 34 38

Dataset Two 11

19 20 23 32 34 41 52

Figure 2. Box and Whisker Plot of the IQR. Image is Public Domain.

Standard Deviation

The Standard Deviation (SD) tells you how closely the data is located around the mean. A small SD means that your data is closely grouped, leading to a taller bell curve; while a large SD informs you that your data is more spread apart. See Figure 3 for graphic interpretation of the distribution of data as a factor of the SD.

Practice Practice Practice!

After an IQ test, I ask you about the location of the mark 90 to this test distribution.

Undoubtedly, you will have no answer; since 90 is just a raw score and you weren’t provided with any other information about the distribution.

On the contrary, if we provided you the mean of this test as 75, you would intuitively determine that your score is higher than the average of the test takers. In numbers, when you subtracted your score from the mean, then obtained a positive difference, you concluded your score is greater than the mean.

But before you get too proud of yourself, you need to figure out the position of your score from the mean. Effectively, knowing your position from the mean helps you know your position within the distribution. Now you see where you are located among your peers!

Let’s try the same subtraction from the mean but using other scores, say a score of 50. Subtract it from the mean (75) and the result is a negative number (-25). Thus, your score is less than the mean. How about a score of 110 on this test? Reiterating, you subtract from the mean; the number is positive (35); so the score is greater.

In these 3 exercises, we have established that the sign of the difference informs the direction of your score from the mean. Moreover, the magnitude of difference itself denotes the distance from the mean when the data is normally

Figure 3. Majority Scores as a factor of Standard Deviations. Image is Public Domain.

34.1% 34.1% 13.6% 13.6% 2.1% 2.1% 0.1% 0.1% -3σ -2σ -1σ 0 1σ 2σ 3σ 0.0 0.1 0.2 0.3 0.4

distributed. To put things into context, the differences from the mean we just calculated are called Deviation Scores and their equation is (X - µ).

The X represents your raw score (the datapoint you are calculating for), while the µ represents the population mean.

Describing Data with Charts

To see “what’s going on with the data,” an appropriate chart is almost always a good idea.

A chart will often reveal previously unsuspected features of the data. Which chart is appropriate depends primarily on the type of data you are dealing with, as well as on the particular features of it you want to explore. In addition, a chart can often illustrate or explain a complex situation for which simple text, or a table, might be unorganized or too long.

What makes a good graph?

Edward Tufte set out the foundations of elegant data presentation with some principles for good data visualization:

1. Show the Data.

2. Direct the Reader to Think About the Data Being Presented.

3. Avoid Distorting the Data.

4. Present Many Numbers, With Minimum Ink.

5. Make Large Datasets Coherent.

6. Encourage the Eye to Compare Different Pieces of Data.

7. Reveal the Underlying Message of the Data.

We also need to avoid chart junk: do not use patterns, 3-D effects, or shadows. Also, do not hide effects or create false impressions of what the data show.

We can use different charts for different data or hypotheses. A Bar Chart is a graph with rectangular bars (see Figure 4). The graph usually compares different categories. Usually, the horizontal (x) axis represents the categories and the vertical (y) axis represents a value for those categories.

Another type is the Pie Chart (see Figure 5): a circular pie split into sectors, one for each category, so that the area of each sector is proportional to the frequency of that category in the dataset.

Pet Ownership

Dogs

Cats

Fish

Rabbits

Rodents

Figure 4. Bar Chart Example. Chart is Public Domain. Figure 5. Pie Chart Example. Chart is Public Domain.

0 1 2 3 4 5 6 7 8 9 10

Red Bue Green Yellow Purple

The last type of common charts we are going to talk about is the Histogram (see Figure 6).

Histogram of arrivals

It is similar to a bar chart, but there should be no gaps between the bars as the datapoints are continuous. The width of each bar of the histogram relates to a range of values for the variable. The histogram should be labeled carefully to make it clear where the boundaries lie.

Figure 6. Histogram Example.Chart is Public Domain.

0 2 4 6 8 10 12 14 Frequency Arrivals per minute

0 2 4 6 8 10 12

INFERENTIAL STATISTICS INTRODUCTION

Generalizability

Back again to the question poised at the introduction of this part, but now with some definitions. If we would like to study how many pupils have Type I Diabetes, then we will call the pupils our Target Population.

Again, do you think we have enough money and time to test every pupil’s blood glucose? of course not.

Therefore, we take a Sample. Now, it is much easier to test them and get a result that characterizes this sample, called a Sample Statistic. We then assume that what is true of the sample will also be true of the target population. Thereafter, based on the sample statistic, we make informed guesses, Inferences, to get conclusions about the whole population, expressed in an Interval Estimate.

This branch of statistics is called Inferential Statistics.

In simpler words, Interval Estimates are numbers that summarize data for an entire population. The Sample Statistics are numbers that summarize data from a sample, i.e. some subset of the entire population.

Central Limit Theorem Means Distribution

Typically, we collect data from a sample, and then we calculate the mean of that one sample. Now, imagine that we repeat the study many times on different samples and collect the same sample size for each one. Then, you calculate the mean for each of these samples and graph them on a histogram. That histogram will display the distribution of all samples’ means, which statisticians refer to as the sampling distribution of the means. (See Figure 1)

In the histogram, note that the more samples you study, the more your means’ distribution will look like a normal distribution (compare n=5 to n=20). This principle is called the Central Limit Theorem. The central limit theorem in statistics states that,

“Given a sufficiently large sample size, the sampling distribution of the mean for a variable will approximate a normal distribution regardless of that variable’s distribution in the population.”

Fortunately, we do not have to repeat studies many times to estimate the sampling distribution of the mean. Statistical procedures can help us estimate that from a single random sample. The take on this is that with a larger sample size, your sample mean is more likely to be close to the real population mean, making your estimate more precise.

Therefore, understanding the central limit theorem is crucial when it comes to trusting the validity of your results and assessing the precision of your estimates.

Normality Assumption

Part of the definition for the central limit theorem is “regardless of the variable’s distribution in the population.” This part is vital. In a population, the values of a variable can follow different probability distributions. These distributions can range from normal, left-skewed, rightskewed, and uniform, among others.

Figure 1. Sampling Distribution of the Means. Image is Public Domain.

The central limit theorem applies to almost all types of probability distributions. The fact that sampling distributions can approximate a normal distribution has critical implications.

In statistics, the normality assumption is vital for parametric hypothesis tests of the mean, such as the t-test.

Consequently, you might think that these tests are not valid when the data are not normally distributed. However, if your sample size is large enough, the central limit theorem kicks in and produces sampling distributions that approximate a normal distribution, even if your data are not normally distributed.

Standard Errors

Striving for feasibility while conducting research forces you to pick a small sample from the whole population to get your measurements. Unfortunately for everyone but statisticians, almost certainly, the sample mean will vary from the actual population mean. This is when the Standard Error (SE) comes into play.

The magnitude of the standard error gives an index of the precision of the estimate of the interval. It is Inversely Proportional to the sample size, meaning that smaller samples tend to produce greater standard errors.

Confidence Intervals

The Confidence Interval (CI) is a range of values we are fairly sure our true value lies in. A 95% CI indicates that if we, for example, repeat our experiment a hundred times, we will find that the true mean will lie between the upper and the lower limits in ninety-five of them.

So how do we know if the sample we took is one of the "lucky" 95%? Unless we get to measure the whole population, we simply do not know. That is one of the risks of sampling.

Point Estimate vs Interval Estimate

Suppose a group of researchers is studying the levels of hemoglobin in anemic male patients.

The researchers take a random sample from the target population and establish a mean level of 11 gram/dl. The mean of 11 gram/dl is a Point Estimate of the population mean.

A point estimate by itself is of limited usefulness because it does not reveal the uncertainty associated with the estimate; you do not have a good sense of how far away this 11 gram/dl sample mean might be from the population mean.

Therefore, Confidence Intervals provide more information than Point Estimates. By establishing a confidence interval of 95% (= Mean ± 2SE), and binding the

interval by the properties of the normal distribution through the bell curve, the researchers arrive at an upper and lower bounds containing the true mean at any time.

Confidence Intervals do not indicate how much of your datapoints are between the upper and lower bounds, but indicate how sure you are that your sample dataset reflects the true population.

Effect Size and Meta-Analyses

Expressing Differences

A usual scenario conducted in clinical trials on drugs is to give a group of patients the active drug (i.e. the real drug) and give the other group a placebo. The difference between the effect of the real drug and the effect of placebo is an example of the concept of Effect Size.

Effect Size is a statistical concept to quantify the difference between groups in different study conditions, therefore, quantifying the effectiveness of a particular intervention relative to some comparison. It allows us to move beyond the simple statistical significance of: “Does it work?” to “How well does it work in a range of contexts?”.

Expanding on the concept, the p-value can tell us that it is statistically significant that both ACE Inhibitors and β-Blockers can treat hypertension. However, it does not inform us which one is better; therefore, we use Effect Size.

Beyond the Results

One of the main advantages of using effect size is that when a particular experiment has been replicated, the different effect size estimates from each study can easily be combined to give an overall best estimate of the size of the effect.

This process of combining experimental results into a single effect size estimate is known as meta-analysis.

Meta-analyses, however, can do much more than simply producing an overall “average” effect size. If, for a particular intervention, some studies produced large effect sizes, and some others produced small effect sizes, would a meta-analysis simply combine them and say that the average effect was “medium” ? definitely not.

A much more useful approach would be to examine the primary studies for any differences between those with the large and the small effects, and then to try to understand what factors might account for the difference. The best effective meta-analysis, therefore, involves seeking relationships between effect sizes and characteristics of the intervention (e.g. context and study design).

Hypothesis Testing

Alternative Hypothesis & Null Hypothesis

Let’s say we have a group of 20 patients who have high blood pressures at the beginning of a study (time 1). They are divided into two groups of 10 patients each.

One group is given daily doses of an experimental drug meant to lower blood pressure (experimental group); the other group is given daily doses of a placebo (placebo group).

Then, blood pressure in all 20 patients is measured 2 weeks later (time 2).

A term that needs to be kicked into the game is the Null Hypothesis (H0). It is the statement that indicates that no relationship exists between two variables (in our example: the two groups).

Following statistical analysis, the null hypothesis can be rejected or not rejected. If, at time 2, patients in the experimental group show blood pressures similar to those in the placebo group, we can not reject the null hypothesis (i.e., no significant relationship between the two groups in the experimental vs the placebo conditions).

If, however, at time 2, patients in the experimental group have significantly lower or higher blood pressures than those in the placebo group, then we can reject the null hypothesis and move towards proving the alternative hypothesis.

Therefore, an Alternative Hypothesis (H1) is a statement based on inference, existing literature, or preliminary studies that point to the existence of a relationship between groups (e.g. difference in blood pressures). Note that we never accept the null hypothesis: We either reject it or fail to reject it. Saying we do not have sufficient evidence to reject the null hypothesis is not the same as being able to confirm its truth (which we can not). After establishing H0 & H1, hypothesis testing begins. Typically, hypothesis testing takes all of the sample data and converts it to a single value: The Test Statistic.

Type I & Type II Errors

The interpretation of a statistical hypothesis test is probabilistic, meaning that the evidence of the test may suggest an outcome, and as there is no way to find out if this outcome is the absolute truth, we may be mistaken about it. There are two types of those mistakes (i.e. errors), and they both arise from what we do with the Null Hypothesis.

A Type I error occurs when the null hypothesis is falsely rejected in favor of the alternative hypothesis (i.e. False Positive: the drug does not lower blood pressure and you say it does).

On the other hand, a Type II error occurs when the null hypothesis is not rejected, although it is false (i.e. False Negative: the drug lowers blood pressure and you say it does not).

These errors are not deliberately made. A common example is that there may not have been enough power to detect a difference when it really exists (False Negative).

Statistical Power & Sample Size

When previous studies hint that there might be a difference between two groups, we do research to detect that difference and describe it. Our ability to detect this difference is called Power. Just as increasing the power of a microscope makes it easier to detect what is going on in the slide, increasing statistical power allows us to detect what is happening in the data.

There are several ways to increase statistical power and the most common is to increase the sample size; the larger the sample size, the more data available, and therefore the higher the certainty of the statistical models. Studies should be designed to include a sufficient number of participants in order to adequately address the research question. Not only is the issue about sample size a statistical one.

Yes, an inadequate number of participants will lower the power to make conclusions.

However, having an excessively large sample size is unethical; because of:

 Unnecessary risk imposed on the participants.

 Unnecessary waste in time and resources.

P-Value & Significance

After getting results and assessing where we go from the null hypothesis, we consider how much these results can help us support the alternative hypothesis. This is subject to a degree of probability that can be quantified using a P-Value.

Simply, the P-value is the probability of getting the results your study found by chance or error.

Therefore, the smaller your p-value, the lesser the probability that your results are there by chance, which means the stronger the evidence to reject the null hypothesis in favor of the alternate hypothesis.

A p-value ≤ 0.05 is usually the threshold for Statistical Significance. A p-value > 0.05 means there is NO Statistical Significance.

However, the p-value relates to the truth of the Null Hypothesis. Do not make the mistake of thinking that this statistical significance necessarily means that there is a 95% probability that your alternative hypothesis is true. It just means that the null hypothesis is most probably false, which means that either your alternative hypothesis is true, or that there is a better explanation than your alternative hypothesis.

Statistical Significance & Clinical Significance

Medical research done in labs around the world should always find its way to patients, an important concept known as bench-to-bedside. The clinical importance of our treatment effect and its effect on the patients’ daily life is named Clinical Significance (in contrast to statistical significance).

Three scenarios may arise from the interplay between clinical and statistical significance. Below we identify the meaning of each scenario to your intervention:

 Clinical Significance Only: Unfortunately, you just failed to identify an intervention that might have saved a lot of lives. This happens most when your study is underpowered due to a small sample size. Try it again with more people.

 Statistical Significance Only: Also unfortunately, you saw a difference when there was actually non. This happens most when your study is overpowered due to a very large sample size (the smallest, trivial differences between groups can become statistically significant). A finding may also be statistically significant without any clinical importance to patients.

 Both Types of Significance Exist: Congratulations! There is an important, meaningful difference between your intervention and the comparison, and this difference is supported by statistics.

INFERENTIAL STATISTICS PARAMETRIC TESTS

Inferential Statistics is not a straight forward task. Because we can never know for sure the true underlying state of the population (i.e. reality), we have to always make some assumptions.

These assumptions are not just lucky guesses; but very specific and calculated steps that help us move forward with our data analysis in the most efficient way (inferential statistics is all about looking beyond the data).

One of the main uses of these assumptions is to help us differentiate between two very different procedures in inferential statistics: Parametric and Nonparametric tests.

Parametric Tests

Parametric Tests are the conservative version of inferential statistics, in which researchers assume numerous strict assumptions about the variables.

Some of the common assumptions for the parametric tests include: Normality, Randomness, Absence of Outliers, Homogeneity of Variances and Independence of Observations.

We will only talk about the most important three assumptions due to the scope of this introductory manual.

Parametric Tests Assumptions

Normality

The most important distinction that helps you with choosing the type of test you will use is whether the results for the variable you are measuring are normally distributed.

This can be done either Graphically, by plotting the data and looking at the graph; or Analytically, using one of the common tests (e.g. Shapiro–Wilk Test or Kolmogorov–Smirnov Test).

Absence of Outliers

After checking for normality, researchers review their dataset for the presence of Outliers. An Outlier is a datapoint that just does not fit with the rest of the dataset (significantly away from the rest of the observations).

These datapoints primarily exist for one of two reasons:

 High Variability, which ultimately can not be excluded from the dataset during the analysis.

 Experimental Error, which can sometimes be a reason to exclude datapoints from the dataset.

Homogeneity of Variances

This fancy statistical assumption simply means that if you are comparing two variables (e.g. two groups), they have to have equal variances (recall that variance is SD2); because otherwise, the study will produce bias, as the results of the analysis should apply to the whole population, both samples should represent the population (variance ensures this is true).

Independent/Two-sample T-test

Used to compare differences between two variables coming from two Separate populations (e.g. one that took caffeine and another one that took a placebo).

Independent T-test Hypotheses

1. H0: The two samples come from the same population (same mean). This would usually mean that the two populations sampled had no true difference in reality.

2. H1: The two samples come from two different populations (different means).

This may mean that the measured effect size reflects a true difference in reality (the caffeine does differ from the placebo.)

Independent T-test Example

A study of the effect of caffeine on muscle metabolism used eighteen male volunteers each of whom underwent arm exercise tests. Half of the participants were randomly selected to take a capsule containing pure caffeine one hour before the test.

The other half received a placebo capsule at the same time. During each exercise, the subjects’ ratio of CO2 produced to O2 consumed was measured (RER), an indicator of whether energy is being obtained from carbohydrates or fats.

The question of interest to the experimenters was whether, on average, caffeine changes RER. The two populations being compared are “men who have not taken caffeine” and “men who have taken caffeine”.

If caffeine has no effect on RER the two sets of data can be regarded as having come from the same population.

CI and Significance of the Results

1. 95% CI for the Effect Size: (13.1, -0.4)

2. Sig. = 0.063

Interpretation:

1. No Statistical Significance for the Effect Size (p-value/sig. > 0.05).

2. The 95% Confidence Interval passes through zero, The Value of No Difference: The Null Value; therefore, the null hypothesis can’t be rejected and it may be concluded that caffeine has no effect on muscle metabolism.

Dependent/Paired T-test

Used to compare two related means (mostly coming from a Repeated Measures design). In other words, it is used to compare differences between two variables coming from One Population, but under Differing Conditions (e.g. two observations for the same participants, one before and one after an intervention).

Dependent T-test Hypotheses

1. H0: The Effect Size for the “before” vs “after” measurements is equal to zero. This would usually mean that the two observations had no true difference in reality.

2. H1: The Effect Size for the “before” vs “after” measurements is not equal to zero.

This may mean that the measured effect size reflects a true difference in reality, the “before intervention” mean does differ from the “after intervention” mean.

Dependent T-test Example

An experiment in which a new exercise program was tested to know whether it increases the hemoglobin contents of the participants.

Fourteen participants took part in the three-week exercise program. Their hemoglobin content was measured before and after the program.

Here, we need to test the null hypothesis: The effect size for hemoglobin scores before and after the exercise program is zero; against the alternative hypothesis: The program is effective.

Significance of the Results: Sig. = 0.110

Interpretation:

No Statistical Significance for the Effect Size (p-value/sig. > 0.05). Therefore, the null hypothesis can’t be rejected and it may be concluded that the exercise program is not effective in improving hemoglobin content.

One-way Analysis Of Variance (ANOVA)

An extension of the t-test, used to determine whether there are any statistically significant differences between the means of three or more independent groups.

ANOVA Hypotheses

1. H0: All groups’ means in the population are equal.

2. H1: At least one group’s mean in the population differs from the others.

ANOVA Example

A study was conducted to assess if there are differences in three means of IQ scores of students from three groups of undergraduate students majoring in different disciplines: Physics, Maths, and Chemistry. Each group included 15 students.

Relevant ANOVA Result: P-value = 0.000

Interpretation:

Statistical Significance (p-value/sig. < 0.05). Therefore, the null hypothesis of the equality of means can be rejected: at least one of the groups is different. To figure out the group causing this difference in the IQ scores, further statistical tests (i.e. post-hoc analysis) might be performed.

Association

An Association is any relationship between two continuous variables. By looking at association, graphically or analytically, we can define the strength of the relationship (using correlation coefficients); the variables changing together; and the direction of the statistical relationship between those variables (e.g. direct or inverse).

You might have heard or read the phrase: association does not imply causation. It is possible to understand Causality as the interaction between two events where one of them is a consequence of another.

Consequently, the simple statistical relationship does NOT necessarily indicate causation (remember the Confounding Factors.)

The words Association and Correlation are usually used interchangeably, but we need to pinpoint the difference to give a relevant mathematical context for the rest of this manual.

Correlation is the type of association that exists between linear variables.

Pearson’s Correlation Coefficient (r)

It ranges from +1 to -1, with positive values suggesting a positive relationship (direct) and negative values suggesting a negative relationship (inverse).

An (r) of zero suggests that there is no relationship and that the two variables are independent from each other.

The closest the r value is to one of the extremes (+1 or -1), the stronger the relationship and linearity in that direction.

Pearson’s Correlation Coefficient Example

Researchers collected data from 20 children (n =20) about their Age, IQ, and Short-term Memory (STM) span. After performing Pearson’s Correlation Coefficient Test, the output was as follows (colors are for your convenience):

Interpretation:

Any p-value less than 0.05 is considered statistically significant for the correlation between the two variables in the row and the column leading to it. Therefore, a statistically significant strong positive linear correlation is observed between Age and STM Span (r = 0.723, p = 0.000).

Regression

Used to determine the strength and quantify the relationship between one dependent variable (x) and one or more other independent variables (y). You can relate a set of x and y values using many types of regression equations.

The most used regression equation is the linear regression model, where you draw single or multiple lines that relate your x and y variables. As this is an introductory manual, we will only discuss the simple linear regression model.

We use the simple linear regression model when we investigate the relationship between a dependent variable and only one independent variable. This model is a very simple equation describing the best single straight line passing through the data observed: “y = a + b(x)”

1. y: the dependent variable

2. a: the constant/y-intercept

3. b: the slope of the line

4. x: the independent variable

Age STM Span IQ Age r 1 0.732 -0.198 Sig.(two-tailed) 0.000 0.402 STM Span r 0.723 1 0.302 Sig.(two-tailed) 0.000 0.195 IQ r -0.198 0.302 1 Sig.(two-tailed) 0.402 0.195

Simple Linear Regression Example

A dataset obtained from a sample of girls was investigated to determine the relationship between their Age (in whole years) and their Forced Vital Capacity (FVC, in Liters). The calculations were carried out using a statistical algorithm and the final equation was defined as “FVC = 0.305 + 0.193(Age)”.

This means that as age increases by 1 year, the FVC increases by 0.193 liters. We will leave the 0.305 intercept for a more advanced setting.

INFERENTIAL STATISTICS NON-PARAMETRIC TESTS

Non-parametric Tests

When one of the assumptions of a Parametric Test is violated (especially Normality), it is not appropriate to use them anymore; you should use the Nonparametric Statistics instead. Non-parametric Tests are distribution-free and are used when the variables of interest are Categorical.

Also note, as established earlier, the less you know about the population, the less certain the results of your analysis become (since the non-parametric tests require less information than the parametric, classical, tests; they are always less precise).

Wilcoxon’s Signed-ranks Test

When the dataset is quantifiable but the variables themselves are ordinal or not normally distributed, the non-parametric Wilcoxon Signed-ranks Test may be used to obtain a confidence interval for the difference in population medians.

Wilcoxon’s Signed-ranks Test Hypotheses

1. H0: The median difference in the population is equal to zero.

2. H1: The median difference in the population is not equal to zero.

Wilcoxon’s Signed-ranks Test Example

The table below contains the results of a case–control study on the dietary intake of people with schizophrenia in Scotland.

It shows the daily energy intake of two dietary substances for the cases and the controls.

Interpretation:

1. The Protein Intake Median Difference CI includes 0 and p-value/sig. > 0.05: NO Statistically Significant Difference between the cases and the controls is noticed.

Intake/day Cases (m=30) Controls (n=30) Median difference (95% CI) Sig. Protein (g) 84.5 (38.4157.4) 96 (40.5-633) 15.9 (-1.1-32.8) 0.07 Alcohol (g) 0(0-19.4) 4.7 (0-80) 5.4 (1.2-9.9) 0.009

2. The Alcohol Intake Median Difference CI does not include 0 and pvalue/sig. < 0.05: A Statistically Significant Difference between the cases and the controls is noticed.

Mann-Whitney U Test/Wilcoxon Rank Sum Test

Used for comparing the same variable in two different groups. This test is the non-parametric alternative to the Independent T-test. The main difference is that the Mann-Whitney U Test is based on mean ranks (or medians), rather than the mean of the original values.

Mann-Whitney U Test Assumptions

Like the Independent T-test, the data must be obtained from two independent random samples for which we assume the original populations had the same variability.

Only the central tendency of the two samples is allowed to be different.

Mann-Whitney U Test Hypotheses

1. H0: The two samples come from the same population (same median).

2. H1: The two samples come from two different populations (different median).

Mann-Whitney U Test Example

This table is from a randomized controlled trial in an emergency department to compare the cost effectiveness of two treatments (ketorolac versus morphine) in relieving pain after a blunt instrument injury.

The table shows the median times (in minutes) spent by the two groups of patients between receiving analgesia and leaving the emergency department. The CI in the last column was derived by Mann-Whitney U Test.

Interpretation:

The Median Difference CI does not include 0: A Statistically Significant Difference between the ketorolac and the morphine groups is noticed for the study outcome: Interval between receiving analgesia and leaving the emergency department.

Ketorolac group (n=75) Morphine group (n=73) Median Difference (95% Cl) Median Time 115.0 (75.0 to 149.0) 130.0 (95.0 to 170.0) 20.0 (4.0 to 39.0)

Kruskal-Wallis Test

Used for comparing the same variable in more than two groups. This test is the non-parametric alternative to the ANOVA. And note, like the ANOVA, it only tells us whether there is a difference between the groups, without telling us which.

Kruskal-Wallis Test Hypotheses

1. H0: All groups come from the same population (same medians).

2. H1: At least one group comes from a different population (different medians).

Kruskal-Wallis Test Example

Three groups of different ethnicities (Black, White, and Others) were tested using the Kruskal-Wallis Test to find whether their “happiness of marriage” scores differ.

Significance of the Results: Sig. = 0.000

Interpretation:

The Median Difference’s p-value/sig. < 0.05: A Statistically Significant Difference between the ethnicities is noticed for the “happiness of marriage” scores.

Spearman’s Correlation Coefficient

Used to measure association. This test is the non-parametric alternative to Pearson’s Correlation Coefficient as a linear association measure. The difference is that any (or both) of the variables is allowed to be ordinal or not normally distributed.

Spearman’s Correlation Coefficient has the same characteristics of the Pearson’s Correlation Coefficient but is usually expressed as ρs in the population and rs in the sample.

Chi-squared Test

Used to test for an association between two Categorical Variables. It gives a p-value with no direct estimate or confidence interval for the estimates. It needs a large sample. In the case of having only two categories, the expected frequencies should not be less than 5.

In the case of having more than two categories, the expected frequencies should not be less than 1 in all categories and no more than 20% of the categories can fall below 5.

Chi-squared Test Hypotheses

1. H0: There is no association between the two variables.

2. H1: There is an association between the two variables.

Chi-squared Test Example

A large US study was done to assess the impact of time to defibrillation after inhospital cardiac arrests on the incidence of cardiac events over hours. A chi-squared test produced a p-value of 0.013 (< 0.05). We can then say that we have Statistical Significance for a link between the incidence of cardiac events over hours and delayed defibrillation time.

Fisher’s Exact Test

Used to test for an association between two categorical variables. It has the same assumptions as the Chi-squared Test. However, unlike the chisquare test, the Fisher’s Exact Test is valid for small samples (where a chi-squared test produces a p-value that is too small).

Fisher’s Exact Test Hypotheses

1. H0: There is no association between the two variables.

2. H1: There is an association between the two variables.

Fisher’s Exact Test Example

The proportions of extremely preterm infants still on home oxygen at age 2 were categorized according to the mode of ventilation at birth: the high frequency oscillation ventilation (HFOV) versus the conventional ventilation (CV).

A Fisher’s exact test was done to test for an association between the mode of ventilation at birth and whether the infants are still on home oxygen. The test produced a p-value of 0.69 (> 0.05). That means that the results of this study are Not Statistically Significant and we do not have enough evidence to reject the null hypothesis.

SOME STUDY DESIGN SPESIFIC STATISTICS

This is a short bonus chapter, in which we first discuss two of the most simple and commonly used statistical methods in clinical research:

1. Relative Risk

2. Odds Ratio.

Then, in the last section of the chapter, we present to you a little comparison between Incidence and Prevalence in statistical terms.

Cohort Studies

The scope of this study design within clinical research is to outline one or more Risk Factors for a certain outcome event (e.g. disease). It follows the exposure forward in time leading up to the outcome. The results of cohort studies are usually expressed as Observed Frequencies of exposure and outcome. The table template below outlines this concept.

Exposed to factor

Fitting your dataset to this table allows you to estimate the Risk of developing the disease of interest within the population from which your sample was drawn:

Estimated Risk= Number of Participants Developing the Disease During the Study Period Cohort's Total Number=a + bn

However, more important to the aims of cohort studies, you may want to estimate the risk of developing the disease for the exposed and the unexposed participants separately.

Fortunately, this can be done using the same principle of Estimated Risk:

Estimated Riskexp= aa + c

Estimated Riskunexp=bb + d

The Relative Risk (RR) is the statistic that describes the association of the exposure and the disease. It can be assessed by comparing the estimated risk of disease for each of the two groups (exposed and unexposed): RR= Estimated RiskexpEstimated Riskunexp=a / (a + c)b / (b + d)

Yes No Total Disease of interest Yes a b a + b No c d c + d Total a + c b + d n = a + b + c + d

An RR = 1 means the exposure has no effect on the disease outcome as both the exposed and the unexposed participants developed it with the same proportion.

An RR > 1 means the exposure may be a Risk Factor of the disease outcome; and an RR < 1 means the exposure may be a Protective Factor against the disease outcome.

When assessing RRs, always look for the CI of the calculation and the strength of the association (is the RR double, triple, more).

Case-Control Studies

Like the Cohort Studies, this study design is also concerned about Risk Factors. However, this design goes back from the outcome and outlines all possible exposures that may have led to it.

This fact is what prevents us from calculating an absolute risk in case-control studies. Instead, we calculate the Odds Ratio (OR) from the observed frequencies:

The Odds Ratio is a ratio of the odds of being a case in the exposed to those of being a case in the unexposed groups.

The odds are the quotient of probabilities of the cases over those of the free of the disease in the group (i.e Oddsexp= aa + c ca + c=ac ; Oddsunexp=bb + ddb + d=bd ).

Therefore, Odds Ratio can be given by: OR=a / cb / d=a db c

OR indicates the odds associating a certain exposure to a disease outcome. An OR = 1 means the odds are the same for exposure and non-exposure. An OR > 1 means the odds of disease development in the presence of the exposure is higher; and an OR < 1 may mean that the exposure is protective against the disease.

Exposed to factor Yes No Total Disease status Case a b a + b Control c d c + d Total a + c b + d n = a + b + c + d

Prevalence vs Incidence Prevalence Incidence

Definition The proportion of a population who have a characteristic (e.g. disease) at a specific time period or point.

Types

Factors

Period prevalence: Old + New Cases over a time period.

Point prevalence: Old + New Cases at a certain time point.

 Disease Duration

 Fatalities

 Migration in/out of the population

 Improved Care and Reporting

The occurrence of a new event (e.g. new cases) in a population over a specific time period.

Incidence Proportion (Risk): Cases who were initially from a disease-free at-risk population.

Incidence Rate: Cases to the population-time at risk during the observation period.

 New Risk Factors

 Population Pattern

 Causative Organism’s Virulence Changes

 Improved Care and Reporting

REFERENCE

Introduction

Hand DJ. Statistics : a very short introduction. Oxford University Press; 2008.

Field AP, Iles J. An Adventure in Statistics : The Reality Enigma. SAGE; 2016.

Vickers A. What Is a P-Value Anyway? : 34 Stories to Help You Actually Understand Statistics. Pearson Education; 2012.

Variables

Field AP. Discovering Statistics Using IBM SPSS Statistics.; 2018.

Descriptive Studies

Tufte ER. Visual Display of Quantitative Information PAPERBACK: Second Edition PAPERBACK. Graphics Press; 2001.

Bowers, David. Medical statistics from scratch: an introduction for health professionals. John Wiley & Sons, 2019.

Petrie, A., & Sabin, C. (2019). Medical Statistics at a Glance (4th ed.). Wiley-Blackwell.

Witte, R. S., & Witte, J. S. (2017). Statistics, 11th Edition. Wiley.

Howell, D. (2012). Statistical Methods for Psychology. David Howell (International Edition). Wadsworth Publishing Company.

Howell DC. Statistical Methods for Psychology.; 2017.

Field AP, Iles J. An Adventure in Statistics: The Reality Enigma. SAGE; 20

Inferential Statistics, Chapters (2-4)

Peacock, J., & Peacock, P. (2011). Oxford Handbook of Medical Statistics. Oxford University Press.

Bowers, D. (2019). Medical Statistics From Scratch: An Introduction For Health Professionals. John Wiley & Sons.

Verma, J. P., & Abdel-Salam, A. S. G. (2019). Testing Statistical Assumptions in Research. John Wiley & Sons.

Petrie, A., & Sabin, C. (2019). Medical Statistics at a Glance. John Wiley & Sons.

Madsen, B. (2011). Statistics for Non-statisticians. Heidelberg: Springer.

Some Study-Design Specific Statistics

Petrie, A., & Sabin, C. (2019). Medical Statistics at a Glance. John Wiley & Sons.

Peacock, J., & Peacock, P. (2011). Oxford Handbook of Medical Statistics. Oxford University Press. Principles and Practice of Clinical Research (4th ed.).

IFMSA-Egypt Research Support Division Directors

36 Mahmoud Nagy IFMSA-Egypt RSDD 21/22 IFMSA-Egypt NORE 22/23 Esraa Amr IFMSA-Egypt RSDD 22/23

Supervising National Team 20/21

Mohamed Hamdy RSDD GA 20/21 Kirollus Nagah RSDD DA 20/21

Manual SWG

Aya Turki

SSS Kasr Al-Ainy

Rana Elbayar

MSSA Mansoura

Special thanks to Research National Team 22/23 for their contributions in reviewing the manual.

l Komy

ny or

Design Team

IFMSA-Egypt

Mohamed Atef

PNSDD 21/23

Mohamed Hany

Publications & Content Creation Assistant 22/23

Habiba Abdelaziz

Publications & Content Creation Assistant 22/23

RESEARCH it out!