Instant ebooks textbook Applied medical statistics 1st edition jingmei jiang download all chapters by Education Libraries

Applied medical statistics 1st Edition Jingmei Jiang

Visit to download the full and correct content document: https://ebookmass.com/product/applied-medical-statistics-1st-edition-jingmei-jiang/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Medical Statistics from Scratch 4th Edition David Bowers

https://ebookmass.com/product/medical-statistics-fromscratch-4th-edition-david-bowers/

Applied Statistics for Environmental Science With R 1st Edition Abbas F. M. Alkarkhi

https://ebookmass.com/product/applied-statistics-forenvironmental-science-with-r-1st-edition-abbas-f-m-alkarkhi/

Applied Statistics: From Bivariate Through Multivariate Techniques Second

https://ebookmass.com/product/applied-statistics-from-bivariatethrough-multivariate-techniques-second/

Applied Statistics in Business and Economics 5th Edition David Doane

https://ebookmass.com/product/applied-statistics-in-business-andeconomics-5th-edition-david-doane/

Applied Statistics: From Bivariate Through Multivariate Techniques Second Edition – Ebook PDF Version

https://ebookmass.com/product/applied-statistics-from-bivariatethrough-multivariate-techniques-second-edition-ebook-pdf-version/

Applied Statistics in Business and Economics, 7e ISE 7th Edition David Doane

https://ebookmass.com/product/applied-statistics-in-business-andeconomics-7e-ise-7th-edition-david-doane/

Applied Statistics: Theory and Problem Solutions with R Dieter Rasch Rostock

https://ebookmass.com/product/applied-statistics-theory-andproblem-solutions-with-r-dieter-rasch-rostock/

Applied Statistics with R: A Practical Guide for the Life Sciences Justin C. Touchon

https://ebookmass.com/product/applied-statistics-with-r-apractical-guide-for-the-life-sciences-justin-c-touchon/

Riemannian geometric statistics in medical image analysis Pennec X (Ed.)

https://ebookmass.com/product/riemannian-geometric-statistics-inmedical-image-analysis-pennec-x-ed/

Applied Medical Statistics

Jingmei Jiang

Department of Epidemiology and Biostatistics, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences/School of Basic Medicine, Peking Union Medical College, Beijing, China

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.

The right of Jingmei Jiang to be identified as the authors of this work has been asserted in accordance with law.

Registered Office

John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA

Editorial Office

9600 Garsington Road, Oxford, OX4 2DQ, UK

For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.

Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that appears in standard print versions of this book may not be available in other formats.

Limit of Liability/Disclaimer of Warranty

The contents of this work are intended to further general scientific research, understanding, and discussion only and are not intended and should not be relied upon as recommending or promoting scientific method, diagnosis, or treatment by physicians for any particular patient. In view of ongoing research, equipment modifications, changes in governmental regulations, and the constant flow of information relating to the use of medicines, equipment, and devices, the reader is urged to review and evaluate the information provided in the package insert or instructions for each medicine, equipment, or device for, among other things, any changes in the instructions or indication of usage and for added warnings and precautions. While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.

Library of Congress Cataloging-in-Publication Data

Names: Jiang, Jingmei, 1958- author.

Title: Applied medical statistics / Jingmei Jiang.

Description: Hoboken, NJ : John Wiley & Sons, Inc., 2022. | Includes bibliographical references and index.

Subjects: LCSH: Medicine--Research--Statistical methods--Textbooks. | Medical statistics--Textbooks. | Biometry--Textbooks.

Classification: LCC R853.S7 J53 2022 (print) | LCC R853.S7 (ebook) | DDC 610.72/7--dc23

LC record available at https://lccn.loc.gov/2021021097

LC ebook record available at https://lccn.loc.gov/2021021098

Cover image: © Andriy Onufriyenko/Getty Images

Cover design by Wiley

Set in 9.5/12pt STIXTwoText by Integra

Services Pvt. Ltd, Pondicherry, India

Contents

Preface xiii

Acknowledgments xv

About the Companion Website xvii

1 What is Biostatistics 1

1.1 Overview 1

1.2 Some Statistical Terminology 2

1.2.1 Population and Sample 2

1.2.2 Homogeneity and Variation 3

1.2.3 Parameter and Statistic 4

1.2.4 Types of Data 4

1.2.5 Error 5

1.3 Workflow of Applied Statistics 6

1.4 Statistics and Its Related Disciplines 6

1.5 Statistical Thinking 7

1.6 Summary 7

1.7 Exercises 8

2 Descriptive Statistics 11

2.1 Frequency Tables and Graphs 12

2.1.1 Frequency Distribution of Numerical Data 12

2.1.2 Frequency Distribution of Categorical Data 16

2.2 Descriptive Statistics of Numerical Data 17

2.2.1 Measures of Central Tendency 17

2.2.2 Measures of Dispersion 26

2.3 Descriptive Statistics of Categorical Data 31

2.3.1 Relative Numbers 31

2.3.2 Standardization of Rates 34

2.4 Constructing Statistical Tables and Graphs 38

2.4.1 Statistical Tables 38

2.4.2 Statistical Graphs 40

2.5 Summary 47

2.6 Exercises 48

3 Fundamentals of Probability 53

3.1 Sample Space and Random Events 54

3.1.1 Definitions of Sample Space and Random Events 54

3.1.2 Operation of Events 55

3.2 Relative Frequency and Probability 58

3.2.1 Definition of Probability 59

3.2.2 Basic Properties of Probability 59

3.3 Conditional Probability and Independence of Events 60

3.3.1 Conditional Probability 60

3.3.2 Independence of Events 60

3.4 Multiplication Law of Probability 61

3.5 Addition Law of Probability 62

3.5.1 General Addition Law 62

3.5.2 Addition Law of Mutually Exclusive Events 62

3.6 Total Probability Formula and Bayes’ Rule 63

3.6.1 Total Probability Formula 63

3.6.2 Bayes’ Rule 64

3.7 Summary 65

3.8 Exercises 65

4 Discrete Random Variable 69

4.1 Concept of the Random Variable 69

4.2 Probability Distribution of the Discrete Random Variable 70

4.2.1 Probability Mass Function 70

4.2.2 Cumulative Distribution Function 71

4.2.3 Association Between the Probability Distribution and Relative Frequency Distribution 72

4.3 Numerical Characteristics 73

4.3.1 Expected Value 73

4.3.2 Variance and Standard Deviation 74

4.4 Commonly Used Discrete Probability Distributions 75

4.4.1 Binomial Distribution 75

4.4.2 Multinomial Distribution 80

4.4.3 Poisson Distribution 82

4.5 Summary 87

4.6 Exercises 87

5 Continuous Random Variable 91

5.1 Concept of Continuous Random Variable 92

5.2 Numerical Characteristics 93

5.3 Normal Distribution 94

5.3.1 Concept of the Normal Distribution 94

5.3.2 Standard Normal Distribution 96

5.3.3 Descriptive Methods for Assessing Normality 99

5.4 Application of the Normal Distribution 102

5.4.1 Normal Approximation to the Binomial Distribution 102

5.4.2 Normal Approximation to the Poisson Distribution 105

5.4.3 Determining the Medical Reference Interval 108

5.5 Summary 109

5.6 Exercises 110

6 Sampling Distribution and Parameter Estimation 113

6.1 Samples and Statistics 114

6.2 Sampling Distribution of a Statistic 114

6.2.1 Sampling Distribution of the Mean 115

6.2.2 Sampling Distribution of the Variance 120

6.2.3 Sampling Distribution of the Rate (Normal Approximation) 122

6.3 Estimation of One Population Parameter 124

6.3.1 Point Estimation and Its Quality Evaluation 124

6.3.2 Interval Estimation for the Mean 126

6.3.3 Interval Estimation for the Variance 130

6.3.4 Interval Estimation for the Rate (Normal Approximation Method) 131

6.4 Estimation of Two Population Parameters 132

6.4.1 Estimation of the Difference in Means 132

6.4.2 Estimation of the Ratio of Variances 136

6.4.3 Estimation of the Difference Between Rates (Normal Approximation Method) 139

6.5 Summary 141

6.6 Exercises 141

7 Hypothesis Testing for One Parameter 145

7.1 Overview 145

7.1.1 Concepts and Procedures 146

7.1.2 Type I and Type II Errors 150

7.1.3 One-sided and Two-sided Hypothesis 152

7.1.4 Association Between Hypothesis Testing and Interval Estimation 153

7.2 Hypothesis Testing for One Parameter 155

7.2.1 Hypothesis Tests for the Mean 155

7.2.1.1 Power of the Test 156

7.2.1.2 Sample Size Determination 160

7.2.2 Hypothesis Tests for the Rate (Normal Approximation Methods) 162

7.2.2.1 Power of the Test 163

7.2.2.2 Sample Size Determination 164

7.3 Further Considerations on Hypothesis Testing 164

7.3.1 About the Significance Level 164

7.3.2 Statistical Significance and Clinical Significance 165

7.4 Summary 165

7.5 Exercises 166

8 Hypothesis Testing for Two Population Parameters 169

8.1 Testing the Difference Between Two Population Means: Paired Samples 170

8.2 Testing the Difference Between Two Population Means: Independent Samples 173

8.2.1 t-Test for Means with Equal Variances 173

8.2.2 F-Test for the Equality of Two Variances 176

8.2.3 Approximation t-Test for Means with Unequal Variances 178

8.2.4 Z-Test for Means with Large-Sample Sizes 181

8.2.5 Power for Comparing Two Means 182

8.2.6 Sample Size Determination 183

8.3 Testing the Difference Between Two Population Rates (Normal Approximation Method) 185

8.3.1 Power for Comparing Two Rates 186

8.3.2 Sample Size Determination 187

8.4 Summary 188

8.5 Exercises 189

9 One-way Analysis of Variance 193

9.1 Overview 193

9.1.1 Concept of ANOVA 194

9.1.2 Data Layout and Modeling Assumption 195

9.2 Procedures of ANOVA 196

9.3 Multiple Comparisons of Means 204

9.3.1 Tukey’s Test 204

9.3.2 Dunnett’s Test 206

9.3.3 Least Significant Difference (LSD) Test 209

9.4 Checking ANOVA Assumptions 211

9.4.1 Check for Normality 211

9.4.2 Test for Homogeneity of Variances 213

9.4.2.1 Bartlett’s Test 213

9.4.2.2 Levene’s Test 215

9.5 Data Transformations 217

9.6 Summary 218

9.7 Exercises 218

10 Analysis of Variance in Different Experimental Designs 221

10.1 ANOVA for Randomized Block Design 221

10.1.1 Data Layout and Model Assumptions 223

10.1.2 Procedure of ANOVA 224

10.2 ANOVA for Two-factor Factorial Design 229

10.2.1 Concept of Factorial Design 230

10.2.2 Data Layout and Model Assumptions 233

10.2.3 Procedure of ANOVA 234

10.3 ANOVA for Repeated Measures Design 240

10.3.1 Characteristics of Repeated Measures Data 240

10.3.2 Data Layout and Model Assumptions 242

10.3.3 Procedure of ANOVA 243

10.3.4 Sphericity Test of Covariance Matrix 245

10.3.5 Multiple Comparisons of Means 248

10.4 ANOVA for 2 × 2 Crossover Design 251

10.4.1 Concept of a 2 × 2 Crossover Design 251

10.4.2 Data Layout and Model Assumptions 252

10.4.3 Procedure of ANOVA 254

10.5 Summary 256

10.6 Exercises 257

11 χ 2 Test 261

11.1 Contingency Table 262

11.1.1 General Form of Contingency Table 263

11.1.2 Independence of Two Categorical Variables 264

11.1.3 Significance Testing Using the Contingency Table 265

11.2 χ 2 Test for a 2 × 2 Contingency Table 266

11.2.1 Test of Independence 266

11.2.2 Yates’ Corrected χ2 test for a 2 × 2 Contingency Table 269

11.2.3 Paired Samples Design χ2 Test 269

11.2.4 Fisher’s Exact Tests for Completely Randomized Design 272

11.2.5 Exact McNemar’s Test for Paired Samples Design 275

11.3 χ 2 Test for R × C Contingency Tables 276

11.3.1 Comparison of Multiple Independent Proportions 276

11.3.2 Multiple Comparisons of Proportions 278

11.4 χ 2 Goodness-of-Fit Test 280

11.4.1 Normal Distribution Goodness-of-Fit Test 281

11.4.2 Poisson Distribution Goodness-of-Fit Test 283

11.5 Summary 284

11.6 Exercises 285

12 Nonparametric Tests Based on Rank 289

12.1 Concept of Order Statistics 289

12.2 Wilcoxon’s Signed-Rank Test for Paired Samples 290

12.3 Wilcoxon’s Rank-Sum Test for Two Independent Samples 295

12.4 Kruskal-Wallis Test for Multiple Independent Samples 299

12.4.1 Kruskal-Wallis Test 299

12.4.2 Multiple Comparisons 301

12.5 Friedman’s Test for Randomized Block Design 303

12.6 Further Considerations About Nonparametric Tests 306

12.7 Summary 306

12.8 Exercises 306

13 Simple Linear Regression 311

13.1 Concept of Simple Linear Regression 311

13.2 Establishment of Regression Model 314

13.2.1 Least Squares Estimation of a Regression Coefficient 314

13.2.2 Basic Properties of the Regression Model 316

13.2.3 Hypothesis Testing of Regression Model 317

13.3 Application of Regression Model 321

13.3.1 Confidence Interval Estimation of a Regression Coefficient 321

13.3.2 Confidence Band Estimation of Regression Model 322

13.3.3 Prediction Band Estimation of Individual Response Values 323

13.4 Evaluation of Model Fitting 325

13.4.1 Coefficient of Determination 325

13.4.2 Residual Analysis 326

13.5 Summary 327

13.6 Exercises 328

14 Simple Linear Correlation 331

14.1 Concept of Simple Linear Correlation 331

14.1.1 Definition of Correlation Coefficient 331

14.1.2 Interpretation of Correlation Coefficient 334

14.2 Hypothesis Testing of Correlation Coefficient 336

14.3 Confidence Interval Estimation for Correlation Coefficient 338

14.4 Spearman’s Rank Correlation 340

14.4.1 Concept of Spearman’s Rank Correlation Coefficient 340

14.4.2 Hypothesis Testing of Spearman’s Rank Correlation Coefficient 342

14.5 Summary 342

14.6 Exercises 343

15 Multiple Linear Regression 345

15.1 Multiple Linear Regression Model 346

15.1.1 Concept of the Multiple Linear Regression 346

15.1.2 Least Squares Estimation of Regression Coefficient 349

15.1.3 Properties of the Least Squares Estimators 351

15.1.4 Standardized Partial-Regression Coefficient 351

15.2 Hypothesis Testing 352

15.2.1 F-Test for Overall Regression Model 352

15.2.2 t-Test for Partial-Regression Coefficients 354

15.3 Evaluation of Model Fitting 356

15.3.1 Coefficient of Determination and Adjusted Coefficient of Determination 356

15.3.2 Residual Analysis and Outliers 357

15.4 Other Aspects of Regression 359

15.4.1 Multicollinearity 359

15.4.2 Selection of Independent Variables 361

15.4.3 Sample Size 364

15.5 Summary 364

15.6 Exercises 364

16 Logistic Regression 369

16.1 Logistic Regression Model 370

16.1.1 Linear Probability Model 371

16.1.2 Probability, Odds, and Logit Transformation 371

16.1.3 Definition of Logistic Regression 373

16.1.4 Inference for Logistic Regression 375

16.1.4.1 Estimation of Model Coefficient 375

16.1.4.2 Interpretation of Model Coefficient 378

16.1.4.3 Hypothesis Testing of Model Coefficient 380

16.1.4.4 Interval Estimation of Model Coefficient 382

16.1.5 Evaluation of Model Fitting 385

16.2 Conditional Logistic Regression Model 388

16.2.1 Characteristics of Conditional Logistic Regression Model 390

16.2.2 Estimation of Regression Coefficient 390

16.2.3 Hypothesis Testing of Regression Coefficient 393

16.3 Additional Remarks 394

16.3.1 Sample Size 394

16.3.2 Types of Independent Variables 394

16.3.3 Selection of Independent Variables 395

16.3.4 Missing Data 395

16.4 Summary 395

16.5 Exercises 396

17 Survival Analysis 399

17.1 Overview 400

17.1.1 Concept of Survival Analysis 400

17.1.2 Basic Functions of Survival Time 402

17.2 Description of the Survival Process 405

17.2.1 Product Limit Method 405

17.2.2 Life Table Method 408

17.3 Comparison of Survival Processes 410

17.3.1 Log-Rank Test 410

17.3.2 Other Methods for Comparing Survival Processes 413

17.4 Cox’s Proportional Hazards Model 414

17.4.1 Concept and Model Assumptions 415

17.4.2 Estimation of Model Coefficient 417

17.4.3 Hypothesis Testing of Model Coefficient 419

17.4.4 Evaluation of Model Fitting 420

17.5 Other Aspects of Cox’s Proportional Hazard Model 421

17.5.1 Hazard Index 421

17.5.2 Sample Size 421

17.6 Summary 422

17.7 Exercises 423

18 Evaluation of Diagnostic Tests 431

18.1 Basic Characteristics of Diagnostic Tests 431

18.1.1 Sensitivity and Specificity 433

18.1.2 Composite Measures of Sensitivity and Specificity 435

18.1.3 Predictive Values 438

18.1.4 Sensitivity and Specificity Comparison of Two Diagnostic Tests 440

18.2 Agreement Between Diagnostic Tests 443

18.2.1 Agreement of Categorical Data 444

18.2.2 Agreement of Numerical Data 447

18.3 Receiver Operating Characteristic Curve Analysis 448

18.3.1 Concept of an ROC Curve 449

18.3.2 Area Under the ROC Curve 450

18.3.3 Comparison of Areas Under ROC Curves 453

18.4 Summary 456

18.5 Exercises 457

19 Observational Study Design 461

19.1 Cross-Sectional Studies 462

19.1.1 Types of Cross-Sectional Studies 462

19.1.2 Probability Sampling Methods 462

19.1.3 Sample Size for Surveys 466

19.1.4 Cross-Sectional Studies for Clues of Etiology 468

19.2 Cohort Studies 469

19.2.1 Measures of Association in Cohort Studies 469

19.2.2 Sample Size for Cohort Studies 470

19.3 Case-Control Studies 472

19.3.1 Measures of Association in Case-Control Studies 472

19.3.2 Sample Size for Case-Control Studies 473

19.4 Summary 474

19.5 Exercises 475

20 Experimental Study Design 477

20.1 Overview 478

20.1.1 Basic Components of an Experimental Study 478

20.1.2 Principles of Experimental Study Design 480

20.1.3 Blinding Procedures in Clinical Trials 482

20.2 Completely Randomized Design 483

20.2.1 Concept of Completely Randomized Design 483

20.2.2 Sample Size for Completely Randomized Design 485

20.3 Randomized Block Design 486

20.3.1 Concepts of Randomized Block Design 486

20.3.2 Sample Size for Randomized Block Design 488

20.4 Factorial Design 489

20.5 Crossover Design 491

20.5.1 Concepts of Crossover Design 491

20.5.2 Sample Size for 2 × 2 Crossover Design 492

20.6 Summary 493

20.7 Exercises 493

Appendix 495 References 549 Index 557

Preface

Over the past few decades, biomedical data have proliferated rapidly, and opportunities have arisen to use this data to improve human health. Burgeoning methods, such as machine learning techniques, have emerged to respond to the rapid growth of the volume of data, and to exploit data in an effective and efficient manner. These methods were founded on statistical learning theory, which is an expansion of traditional statistics. Therefore, cultivating basic statistical thinking capability plays an important and fundamental role in mastering these state-of-the-art methods and embracing the upcoming big data era, which makes a course of introductory biostatistics an indispensable part of the curriculum for medical students. However, as a branch of mathematics, statistics is characterized by hierarchically organized concepts, but a conceptual understanding of statistics is not always intuitive, which makes biostatistics an obstacle that is regarded as a burden for most medical students. During almost 30 years of teaching statistics at the Chinese Academy of Medical Sciences & Peking Union Medical College, China, I have experienced too many occasions on which generations of students, both undergraduate and postgraduate, have felt that they are struggling to grasp the essence of statistical concepts and the implications of mathematical formulas, and to master complex analytical methods. Moreover, their motivation to learn biostatistics has also been dampened by abstruse formulas and derivation processes. Therefore, a readerfriendly text that can provide sufficient help for developing statistical thinking and building propositional knowledge, as well as understanding and mastering analytical skills, is of great necessity, which was my motivation for writing this book.

Applied Medical Statistics is an introductory-level textbook written for postgraduate students in the human life-science field, with most topics also being suitable for undergraduate medical students. The ultimate objective of this book is to provide help in developing “habits of mind” for statistical thinking, and to establish a trade-off between mathematical derivation and know-how application among medical students. The most distinctive features of this book are summarized as follows: First, emphasis is placed on the most basic probability theory at the start of the book because, as the theoretical pillar for almost all statistical methods, strengthening these fundamental concepts is of great importance for laying a solid theoretical foundation for understanding subsequent chapters. However, for students to benefit from a practical and intuitive understanding of principles, rather than presenting abstract concepts, I have minimized the mathematical sophistication, and introduced content in a user-friendly style to nurture interest and motivate learning. Second, I have based most of the

Applied Medical Statistics, First Edition. Jingmei Jiang.

Companion website: www.wiley.com\go\jiang\appliedmedicalstatistics

working examples on research projects that I have conducted or participated in, and such real-world settings, in my view, are more helpful for stimulating students’ interest, as well as helping them to learn how to use statistical procedures in practice. Finally, although this is an elementary applied statistics textbook, it covers some commonly used advanced statistical techniques, such as survival analysis and logistic regression. I also discuss fundamental issues in research design, and the inclusion of this content will greatly enhance the applicability and benefit to students who need to reference this book while performing day-to-day medical research.

I have organized the content of this book in a cohesive manner that links all the relevant foundation concepts as building blocks. Chapter 1 starts with an introduction to the basic concepts of biostatistics, and a section called “statistical thinking” strengthens the importance of statistical thinking in solving real-world problems. Chapter 2 contains an introduction to the basic concepts and application of some fundamental summary statistics. Moreover, it also covers how to organize data and display data using graphical methods. Chapters 3 to 5 are compact, and provide background supporting information to enable students to understand the basic rationale of biostatistics, in addition to laying a theoretical foundation for subsequent chapters. Chapter 3 contains the development of the basic principles of probability, with suitable examples. Chapter 4 covers the fundamental concepts of random variables and discrete probability distribution, including binomial distribution, multinomial probability distribution, and Poisson distribution. Chapter 5 briefly introduces the most commonly used continuous probability distributions: mainly normal and standard normal distributions. Chapter 6 mainly focuses on an introduction to the sampling distribution, as well as parameter estimation, and plays a unique role in linking descriptive statistics to inferential statistics. This chapter starts the formal discussion of the theoretical background, as well as the application of inferential statistics. Chapters 7 to 10 contain the basic principles of hypothesis testing and the elementary parametric hypothesis testing methods for normally distributed data in two-sample and multiple-sample scenarios, such as the t-test and analysis of variance methods. The common requirement for implementing these methods is the assumption that the underlying population should be normally distributed. Chapter 11 contains an introduction to the fundamental concepts of hypothesis testing methods for categorical data, the chi-square test, and Fisher’s exact test, which are widely used in statistical analysis. Chapter 12 contains an overview of some of the most well-known non-parametric tests suitable for scenarios in which assumptions of normality can be relaxed. Chapters 13 and 15 contain introductions to extensively used models and techniques for exploring the association between risk or predictor factors and continuous response variables. Chapter 13 mainly focuses on the basic concepts and application of simple linear regression, and Chapter 15 covers its extension: multiple linear regression. Additionally, Chapter 14 contains an introduction to simple correlation and rank correlation, which measure the strength of the relationship between two variables. Chapters 16 and 17 contain an introduction to some essential analysis techniques for modeling the binary and time-to-event response variables, such as unconditional and conditional logistic regression, and the Cox proportional hazards model used in processing time-to-event data. Chapter 18 then covers the most commonly used statistical evaluation indices and methods in diagnostic tests. Chapters 19 and 20, as the concluding chapters of this textbook, contain a discussion of methods for design and sample size estimation issues for observational and experimental studies.

Acknowledgments

I am grateful for the support I received from many people and institutions during the writing of this book. First and foremost, I express my deepest gratitude to an expert and consultant team, which included professors Youshang Zhou, Songlin Yu, Konglai Zhang, and Hui Li, all of whom are well-known Chinese statisticians and epidemiologists. Their unconditional support and encouragement at every stage of the writing of this book made it possible for me to complete this work.

I am also grateful for the immense help that I received from my colleagues at the School of Basic Medicine of Peking Union Medical College. Professor Tao Xu and Dr. Fang Xue deserve special acknowledgement for providing assistance through conducting a professional review of my work, and their constructive comments greatly improved the manuscript. I also want to acknowledge help from Doctors Wei Han, Zixing Wang, Yaoda Hu, and Haiyu Pang, who provided assistance in the production of this book through copyediting, reviewing, and correcting many subtle errors. Fruitful discussions with them also improved how the manuscript treated certain topics.

In particular, I appreciate the help of my post-graduate students in putting together this book. Peng Wu help me in organizing much of the material and analyzing the data in the examples; Ning Li and Cuihong Yang produced accurate figures and diagrams; Yubing Shen and Luwen Zhang checked the accuracy of all the formulae; Yali Chen and Lei Wang constructed the index and checked the accuracy of terminology and reference sources; Jin Du and Yujie Zhao checked the answers to exercises; and Wentao Gu improved the quality of the mathematical formulas. Without their help, this work would have been far more difficult to complete.

Much of the motivation of writing comes from teaching and supervising post-graduate students at Peking Union Medical College. I am grateful for their inquisitive questions and useful feedback on a draft version of this manuscript, which allowed me to improve the final version.

I express my gratitude to the research projects from which I obtained the data and background for the examples and exercises; these projects were funded by the National Natural Science Foundation of China, Ministry of Science and Technology Fund, Chinese Academy of Medical Sciences Innovation Fund for Medical Sciences, Cancer Research UK, UK Medical Research Council, and US National Institutes of Health.

Acknowledgments

I would like express my deep and sincere gratitude to managing editor Kimberly Monroe-Hill for the professional guidance, coordinating effort and continued support during the entire drafting and publication process. I also wish to thank commissioning editor James Watson for assisting us in many ways with this book. I would also like to thank Arthi Kangeyan and Dilip Varma, the content refinement specialists, for the professional help in getting the manuscript ready for production.

I thank the School of Basic Medicine of Peking Union Medical College for making available all the support that I needed in the writing process.

Thanks are owed to Dr. Maxine Garcia, Dr. Jennifer Barrett, and the team at Edanz Group China for their dedicated and professional language editing support.

Finally, I thank my family for their understanding and encouragement while I was writing this book.

About the Companion Website

This book is accompanied by a companion website: www.wiley.com\go\jiang\appliedmedicalstatistics

The website includes the solutions manual and data sets.

1 What is Biostatistics?

1.1 Overview 1

1.2 Some Statistical Terminology 2

1.2.1 Population and Sample 2

1.2.2 Homogeneity and Variation 3

1.2.3 Parameter and Statistic 4

1.2.4 Types of Data 4

1.2.5 Error 5

1.3 Workflow of Applied Statistics 6

1.4 Statistics and Its Related Disciplines 6

1.5 Statistical Thinking 7

1.6 Summary 7

1.7 Exercises 8

1.1 Overview

Data are present everywhere in our lives, and almost all types of scientific research have to deal with the collection, description, or analysis of data. This makes statistics one of the most powerful methodologies across all disciplines for exploring the unknown world. Statistics is a discipline on its own and has a wide spectrum of theories, methods, and applications. A prerequisite for discussing the theory and application of statistics is the definition and statement of its objectives. According to Merriam–Webster’s Collegiate Dictionary, statistics is “a branch of mathematics dealing with the collection, analysis, interpretation, and presentation of masses of numerical data.” According to the Random House College Dictionary, it is “the science that deals with the collection, classification, analysis, and interpretation of information or data.” According to The New Oxford English–Chinese Dictionary, it is “the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.” Although there are some differences among these definitions, each definition implies that statistics is a science of data and uses the theory of mathematical statistics to make inferences.

The application of statistical theories and methods to medical research fields is termed “medical statistics,” or more broadly, biostatistics when applied to life sciences.

There are two branches of biostatistics based on its functions: (i) statistical description is concerned with the organization, summarization, and description of data; and (ii) statistical inference is concerned with the use of sample data to make inferences about the characteristics of a larger set of data. This division of descriptive and inferential statistics helps us to establish a progressive learning framework for statistics. However, this division is not always necessary in scientific activities where the two branches complement each other in deepening our knowledge of the real world.

We briefly review the development of biostatistics. In London in 1603, the Bills of Mortality began to be published weekly, which is generally considered to mark the beginning of biostatistics. Since then, related theories have continued to emerge, and the early twentieth century ushered in the peak of development of biostatistics. Several pioneers played a crucial role in the development of the theoretical framework and applications of biostatistics. G.J. Mendel (1822–1884), the father of modern genetics, used probability rules to discover the basic laws of biogenetics in the 1860s. He is considered to be one of the first to apply mathematical methods to biology. K. Pearson (1857–1936), the founding father of modern statistics, established the world’s first department of statistics at University College London in 1911, and developed several key statistical theories (e.g., measure of correlation and χ 2 distribution). W.S. Gosset (1876–1937) proposed the t distribution and t -test in 1908, which laid the foundation for the sampling distribution of the sample mean, and signified the establishment of small sample theory and methodology. R.A. Fisher (1890–1962) developed statistical significance tests, and various sampling distributions, and established the experimental design method and related statistical analysis technique. These were collected in Design of Experiments, which was first published in 1935. With the efforts of these pioneers and other statisticians, after hundreds of years, a complete theoretical system of biostatistics had formed. At the present time, the development of biostatistics is being driven by the unprecedented and still growing range of life science applications using advances in computing power and computer technology, and new formats of data that continue to emerge. Despite this, the ideas of basic statistics have not changed: to make an inference about a population based on information contained in a sample from that population and to provide an associated measure of goodness for the inference.

1.2 Some Statistical Terminology

In this text, we aim to explain basic statistical methods commonly applied in biomedical research. Before this, we provide an overview of several statistical terms, which are the premise for further learning.

1.2.1

Population and Sample

A population (statistical population or target population) is a certain or some characteristics of study subjects that are our target of interest. Population is usually denoted by X (also called random variable), and can be viewed as a dataset. The basic unit that constitutes the population is called the individual

The dataset that defines a population is typically large or conceptual. The former suggests a finite population because it has a finite number of individuals regardless of how large it is. For example, the dataset of the heights of all the college freshman boys in Beijing in 2020 is a finite population (though very large). When the dataset only exists conceptually, we call it an infinite population, for example, the weights of infants and the antihypertensive treatment effects of a certain drug. The sampling theory and statistical inference principle introduced in this text are based on an infinite population.

A sample, denoted by XX X n 12,,…, ( n is the sample size), is a subset of data selected from a population. The purpose of obtaining a sample is to infer about the characteristics of its underlying unknown population.

The process of drawing a sample from a population is termed sampling. In practice, depending on the research objectives and feasibility, samples can be obtained using random or non-random sampling. A random sample is obtained through probability sampling. In this text, we generally assume the use of a simple random sample in which each individual in the population has an equal chance of being sampled. Non-random sampling relies on the subjective judgment of the researcher and is beyond the scope of this text.

Note the following: (i) The concept of population is different in biomedical research and statistical terminology. In biomedical research, the term “study population” (or study subject) typically refers to a group of humans or other species of organism, whereas the characteristics of the study subjects are the population we are interested in statistics. For example, in a study of blood glucose concentrations among 3-year-old children, all children of that age are regarded as the study population. However, from a statistical point of view, all blood glucose concentrations in children of that age constitute the population of interest. (ii) Although the dataset of a population is typically large, the essential difference between the population and the sample is not the amount of data we have, but the objective of the research. If the objective is to provide a description only, then the data we have can be regarded as a population, regardless of how small it is, whereas if the objective is to draw an inference, then we need to clarify what population we are interested in, and consider how to obtain a representative sample, or how good the sample at hand is. The representativeness of the sample of the population is a very important basis for a reasonable inference.

1.2.2 Homogeneity and Variation

In statistics, homogeneity means the similarity among individuals within a population. In fact, without homogeneity, we can rarely define a population. The individual differences in a homogenous population are termed variation.

Example 1.1 Survey of the height of college freshman boys in Beijing in 2020.

Homogeneity: College freshman boys in Beijing in 2020.

Variation: Individual differences in height.

Example 1.2 Study of the antihypertensive treatment effects of a drug.

Homogeneity: Hypertensive patients taking this drug.

Variation: Individual differences in the treatment effects.

From Examples 1.1 and 1.2, we can see that homogeneity refers to similarities in the nature, condition, or background of individuals in a population. The mission of statistics

can be interpreted as describing the features of a homogenous population and identifying the heterogeneity of different populations. Variation is an inherent attribute of life sciences, and biomedical researchers should learn to use statistical methods to reveal the laws of biological phenomena in the context of variation.

1.2.3 Parameter and Statistic

A descriptive measure of the characteristics calculated on a population is called a population parameter, or simply, a parameter, generally denoted by the Greek letter θ . For example, in the survey of the height of freshman boys, the population mean (average height, typically denoted by µ ) is a parameter. However, it is difficult to have data for the entire population most of the time, so a sample is used instead. Correspondingly, a descriptive measure based on a sample is called a sample statistic, or simply, a statistic For example, if we draw a sample (typically a random sample) from the population and calculate the average height, the sample mean is a statistic and is typically denoted by x . The mathematical definition and roles of statistics are elaborated on in Chapter 6. Because most populations are theoretical, the parameters are constants that are usually unknown, whereas the statistics are calculated from samples, which are indeterminate, and the values of statistics could be different for different samples.

1.2.4 Types of Data

Data are the representation or observation of the characteristic population. Data can be classified as numerical and categorical, depending on their properties:

(1) Numerical data, also known as quantitative data, are the data expressed in numbers and are obtained by measuring each research subject’s indices, that is, the quantity or number of things. Numerical data differentiate themselves from other number-form data types as a result of the ability to perform arithmetic operations using these numbers. We can subdivide numerical data into two types:

Continuous data occur when data can be measured on a continuum or scale, i.e., there is a possible value between any other two values.

Most numerical data in biomedical research are continuous or can be viewed as continuous. For instance, if we conduct a survey on the health and nutritional status of 7-year-old boys in a less developed region in 2020, the measurement results of their heights (cm), weights (kg), and hemoglobin (g/L) can be viewed as continuous data because their values can assume, in theory, any value in a certain range.

Discrete data occur when the data can only take certain values. The possible values of discrete data are generally integers. For instance, if we also collect data on the number of cases of cold 01 2 ,, , () in 2020 for the 7-year-old boys, then they are discrete data.

(2) Categorical data, also known as qualitative data, include two subtypes:

Unordered categorical data are obtained by dividing research subjects into two or more unordered groups. For instance, we can denote a man and woman as 1 and 2 for sex and denote A, B, O, and AB as 1, 2, 3, and 4 for blood type. Unlike numerical data, the numbers representing different categories do not have mathematical meanings.

Individual values do not have a quantitative difference if they belong to the same category and have qualitative differences if they belong to different categories.

Ordinal categorical data are obtained by dividing research subjects into orderings of an attribute. They are not measured; nonetheless, they have a potential ordering. For instance, the treatment effect of a disease can be ordered as cured, effective, improved, ineffective, and deteriorated. The laboratory test results of urine protein determination can be ordered as , ± , + , ++, and +++ . We can also use numerical values such as 12 3 ,, , to represent the potential grades, although the numbers do not have numerical meanings.

Numerical data and categorical data are not set in stone; under certain conditions, they can be exchanged according to the research objectives and statistical methods used. For example, in a large survey on hypertension, the blood pressure values collected are numerical data. If we want to estimate the prevalence of hypertension, we could group survey participants according to whether they are hypertensive (1 for hypertensive and 0 for not hypertensive), and the data become unordered categorical data (binary data). If we want to know the degrees of hypertension, the blood pressure measurements can be reclassified into ordinal categorical data. Conversely, categorical data can also be changed to numerical data. For example, if we want to compare the epidemic of hypertension in different regions, we could use binary data to calculate the hypertension prevalence p, which ranges from 0 to 1 and belongs to the scope of numerical data. In the study design, we should collect as much raw data (original data) as possible in numerical form to minimize the loss of information and allow for flexible transformation.

1.2.5 Error

Error refers to the difference between the observed value and real value (parameter). The following formula defines the relation between them:

x =+θε, (1.1)

where x denotes the observed value; θ denotes the real value, theoretically; and ε denotes the error, which can represent a random error or systematic error.

(1) A random error, as the name suggests, is completely random, that is, the magnitude and sign of ε cannot be predetermined, and the scope ε ∈−∞+ ∞ () , . A random error is caused by the influence of many uncertain factors in the actual observation or measurement process.

As shown in Formula 1.1, a random error can be interpreted in many ways. For example, if x is the measured value in an experiment, then εθ =− x reflects the measurement error in the results of each measurement. Additionally, the sampling error is the most typical type of random error. If x is a sample statistic, then εθ =− x reflects the difference between statistic x and the parameter θ resulting from the sampling process, which is fundamental to the study of statistical inference introduced in Chapter 6.

(2) A systematic error, also known as bias in epidemiology, is another type of error that has a fixed magnitude and directional systematic deviation from a real number, that is, ε =≠() aa 0 , where a is a constant. A systematic error is caused by the influence

of certain factors, for example, an uncorrected instrument, the sensory disturbance of the measurer, or high or low standards in evaluating a treatment effect.

Random errors are unavoidable but could manifest some laws of regularity in some conditions. The study and application of the law of random errors is one of the most important elements of statistics. In practice, random and systematic errors often coexist, both requiring considerations in the study design and data analysis.

1.3 Workflow of Applied Statistics

The following four steps in applied statistical workflow are indispensable in practice:

Statistical design: This marks the beginning of scientific research, and is directly responsible for the accuracy and reliability of the research results. Statistical design should be conducted with specific research objectives and domain knowledge. This means that good research design is inevitably based on interactions between domain experts and statisticians. Two categories of research design exist in general, observational design and experimental design, which we discuss in Chapters 19 and 20, respectively.

Data collection: Data collection is used to obtain the raw data required by research through a reasonable and reliable approach. The collection of representative data is important for obtaining reliable conclusions. Regardless of which method is used, the accuracy and integrity of the data should be given high priority.

Statistical analysis: The next step is the management and analysis of the raw data according to the research objectives and types of data. This step typically includes the statistical description, statistical inference, and (or) statistical modeling for mining the information hidden in the data.

Statistical reporting: After all the steps are executed, the analysis results are displayed. Appropriate statistical tables and graphs can be used to enhance the presentation of results. Final conclusions and suggestions are drawn, guided by domain knowledge. A key feature of statistical reporting is that all conclusions are probabilistic.

1.4 Statistics and Its Related Disciplines

The discipline of statistics does not stand alone. Instead, it is closely related to the development of other disciplines.

Statistics and medicine: Statistics not only helps to solve practical problems, but also promotes its own development during the process. Its application to the biomedical sciences is a typical demonstration of this. With the further understanding of data in the twenty-first century, evidence-based medicine, precision medicine, and other quantitative methods will provide a broader space for applying statistics.

Statistics and mathematics: Statistics is a branch of mathematics. The mathematical basis of statistics is the theory of probability and calculus. However, this does not mean that learning statistics must be based on knowledge of advanced mathematics. In fact, the objective of learning statistics is not to master complicated

mathematical proofs but the application of statistical thinking and methods to solve problems that arise in scientific research.

Statistics and computer science: Modern statistics cannot be separated from developments in computer science. The field of statistics has benefited greatly from advances in computing power. In the digital era, computer science and information technology are as important to statistics as the theory of probability. Computer software has become an important auxiliary tool for statistical analysis. The conclusions are largely the same using different statistical software, even if the numerical results have minor differences. To avoid any distraction caused by these technical issues in learning statistical ideas and methods, in this text, we present results mainly using SPSS, among other alternatives.

1.5 Statistical Thinking

Statistical thinking includes applying rational thinking and statistical science to critically evaluate data and the resultant correct and false inferences. How does statistical thinking play its role in scientific research practice? To answer this question, we must note that inferences based on sample data are almost always subject to error because a sample does not provide an exact image of the population.

The population is typically a theoretical and conceptual truth of interest. The science of statistics helps us to establish a methodological framework or workflow to draw inferences about the unknown characteristics of the population using the sample of limited data at hand, based on one or a few assumptions. The statistical inference process is an important part of the scientific method. Inference based on experimental or observational data is first used to develop a theory about some phenomenon. Then the theory is tested against additional sample data.

Errors may occur in the inference process based on a sample. What matters is how we quantify and evaluate the error. Statistics connects the quantification of errors with the measurement of the reliability of inference using probability. This connection provides a solid theoretical basis for reasonable statistical inference.

Statistics builds a bridge between abstract theoretical concepts and the solution of specific problems. It enables researchers to make inferences (estimates and decisions about the target population) with a known measurement of reliability. With this ability, a researcher can make intelligent decisions and inferences from data; that is, statistics helps researchers to think critically about their results.

We end this chapter with remarks from the famous statistician, C.R. Rao.

All knowledge is, in the final analysis, history. All sciences are, in the abstract, mathematics. All judgments are, in their rationale, statistics.

1.6 Summary

The learning objective of this chapter is to understand some basic concepts in statistics and the role of statistics in biomedical research, which are the basis for future learning.

Statistics is a science about data, and its basic characteristic is that it is a quantitative science.

Two branches, statistical description and statistical inference, constitute the main content of statistics.

The application of statistics to biomedical research generally includes the following four steps: statistical design, data collection, statistical analysis, and statistical reporting.

Statistical thinking includes the application of rational thinking and statistical science to critically evaluate data and make inferences from them.

1.7 Exercises

1. Suppose you were so interested in the waist circumference of your schoolmates that you prepared a tape measure in a statistics class and measured the waist circumference of all your classmates who were present. Answer the following questions:

(a) Decide whether the data you obtained is a sample or population? For what research objectives should it be considered a sample or population?

(b) If it is considered a sample, what is the population you are drawing an inference about? How representative of the population is it?

(c) How do you determine the homogeneity of your population? Is there heterogeneity? If yes, how can you improve the homogeneity? Is there variation? What may lead to this variation?

(d) Are there errors in the obtained data? What are the random errors and systematic errors? Can you tell the difference between them? Can you, and how do you, minimize the errors?

(e) What steps do you need to follow to complete a report on your survey?

2. Choose a quantitative research article in clinical medicine, basic medicine, public health, or any biomedical research topic you are interested in and answer the following questions:

(a) What is the population and how is it defined from the perspectives of the research and statistics, respectively? What are the differences between the concepts of population using different perspectives?

(b) Is the sample presented in the research a random sample? What are the advantages of a random sample and non-random sample?

(c) Illustrate the relationship between the population and sample, and between homogeneity and variation using your selected paper.

(d) Is there any factor that may lead to random or systematic errors in the research? How do you distinguish them? How have they been minimized? Can you think of ways to further minimize the errors?

(e) What data are collected? What are the types of data? How do you determine the type of data? Which type of data contains more information? Do these types of data allow for further transformation?

(f) How many steps are involved in the statistical plan? What are the specific roles of these steps and what is the relation between these steps?

(g) Are the conclusions obtained from the research correct? How does the knowledge of statistics learned from this chapter help you with critical thinking?

(h) Can you follow the conceptual path as laid out by the research and use statistical critical thinking to solve a problem that interests you in your daily life? Try to create a statistical design as you deepen your knowledge and skills through further learning.

2 Descriptive Statistics

2.1 Frequency Tables and Graphs 12

2.1.1 Frequency Distribution of Numerical Data 12

2.1.2 Frequency Distribution of Categorical Data 16

2.2 Descriptive Statistics of Numerical Data 17

2.2.1 Measures of Central Tendency 17

2.2.2 Measures of Dispersion 26

2.3 Descriptive Statistics of Categorical Data 31

2.3.1 Relative Numbers 31

2.3.2 Standardization of Rates 34

2.4 Constructing Statistical Tables and Graphs 38

2.4.1 Statistical Tables 38

2.4.2 Statistical Graphs 40

2.5 Summary 47

2.6 Exercises 48

In the previous chapter, we learned that there are two branches in statistics –description and inference. Statistical description, as the basis of statistical inference, provides a way to organize and summarize data in a meaningful and intuitive manner. In this chapter, we introduce several basic statistical tools for describing data. These tools include tables and graphs that rapidly convey a concise presentation or visual picture of the data, as well as numerical measures that describe certain characteristics of the data. The appropriate tool depends on the type of data (numerical or categorical) that we want to describe.

Example 2.1 In a survey on the physiological characteristics of school-age children in a certain region in 2010, 153 10-year-old girls were randomly selected, and several physiological indicators were measured and recorded. The raw data of girls’ height are shown in Table 2.1.

The data shown in Table 2.1 are raw data presented in an unorganized manner. Although it would be easy to find the highest and lowest values in this sample, it would be very difficult to extract more useful information from this set of data without organizing them using descriptive statistical techniques.

Applied Medical Statistics, First Edition. Jingmei Jiang.

Companion website: www.wiley.com\go\jiang\appliedmedicalstatistics

Instant ebooks textbook Applied medical statistics 1st edition jingmei jiang download all chapters

Applied medical statistics 1st Edition Jingmei Jiang

Applied Medical Statistics

Applied Medical Statistics

18 Evaluation of Diagnostic Tests 431

Preface

Acknowledgments

About the Companion Website

1

What is Biostatistics?

CONTENTS

1.1 Overview

1.2 Some Statistical Terminology

Population and Sample

1.2.2 Homogeneity and Variation

1.2.3 Parameter and Statistic

1.2.4 Types of Data

1.2.5 Error

1.3 Workflow of Applied Statistics

1.4 Statistics and Its Related Disciplines

1.5 Statistical Thinking

1.6 Summary

1.7 Exercises

2

Descriptive Statistics

CONTENTS