

Summer Research Institute 2025
DATA ANALYSIS – ORIENTED UNDERGRADUATE RESEARCH: INDUSTRY 4.0 AND BIG DATA
ANALYSIS IN HEALTHCARE

Professor
Fatma Pakdil, MBA, Ph.D.
Professor of Management
College of Business Department of Business Administration
Fatma Pakdıl, Ph.d. - About



Introduction
“Data Analysis–Oriented Undergraduate Research: Industry 4.0 and Big Data
Analysis in Healthcare,” explores how emerging technologies and data-driven methodologies are being used not just in labs and corporations, but also in undergraduate research environments to tackle real-world healthcare challenges.

Data Sources & Methodology
Utilized HCUP and the Nationwide Readmissions Database (NRD)

Developed and tested 8 research hypotheses

Applied linear and binary logistic regression models and control charts for data analysis

Findings supported by graphs, charts, and statistical visualizations
Introduce research methodology and scientific writing
• Quantitative Research
Dependent and independent variables
Inclusion/exclusion
Background

• Research on patients who undergo Total Hip Arthroplasty (THA) and Total Knee Arthroplasty (TKA) has shown that the length of hospital stay (LOS) can influence the chances of being readmitted after surgery.
• Studies often use statistical models to explore patterns in patient recovery. These models have shown that, over time, patients tend to have shorter hospital stays as recovery practices improve.
• While LOS is an important factor, other elements like a patient’s overall health, existing medical conditions, and where they are discharged to (home, rehab, etc.) also play a major role in whether they return to the hospital.
• Understanding these different factors is essential for improving recovery plans and making better decisions for patient care after surgery.
• Total Hip Arthroplasty (THA) is a medical procedure in which a damaged hip joint is replaced with an artificial implant. This surgery is usually recommended for patients with severe arthritis, fractures, or other hip joint issues that cause chronic pain and limit mobility.
• Total Knee Arthroplasty (TKA) involves replacing a worn or damaged knee joint with a prosthetic implant. This procedure is typically performed on individuals suffering from advanced osteoarthritis or injury that causes ongoing knee pain and stiffness.
• Both THA and TKA are common and effective surgeries for managing joint problems in aging or severely injured populations.
What is THA and TKA?

What
is
Big Data?
Big data refers to extremely large and complex datasets that are difficult to process, manage, and analyze using traditional data processing tools. It is typically characterized by the "5 Vs":
• Volume – The amount of data generated is massive.
• Velocity – Data is created and processed at high speed.
• Variety – Data comes in many formats (text, images, videos, etc.).
• Veracity – The quality and accuracy of the data can vary.
• Value – The insights gained from analyzing big data can provide significant value.
Big data is used in many fields, including healthcare, business, social media, and science, to find patterns, make predictions, and improve decision-making.
Why do we need it for our research?
• More Accurate Results:
With access to huge datasets, researchers can spot patterns and trends that would be impossible with small samples. This leads to stronger, more reliable conclusions.
• Faster Discoveries: Big data tools help process information quickly, so research that used to take months or years can now be done in days or weeks.
• New Research Opportunities: It allows researchers to explore complex questions, like predicting disease outbreaks or analyzing millions of social media posts to study public behavior.
• Personalized Insights: In fields like healthcare, big data can help create more personalized treatment plans by analyzing patient records, genetics, and outcomes on a large scale.
Laney, D. (2001). 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Group Research Note.
Rehman, A., Naz, S. and Razzak, I. (2022), “Leveraging big data analytics in healthcare enhancement: trends, challenges and opportunities”, Multimedia Systems, Vol. 28 No. 4, pp. 1339-1371, doi: 10.1007/s00530-020-00736-8
What is Industry 4.0?
Industry 4.0 (also called the Fourth Industrial Revolution) refers to the current trend of automation and data exchange in manufacturing and other industries. It combines advanced technologies to make factories smarter, more efficient, and more connected.
Why do we need it for our research?
• Better Data Collection & Analysis
Smart sensors, machines, and devices collect large amounts of data in real time. This helps researchers study patterns, test theories, and make discoveries much more accurately.
• Faster Research Process
Automation and AI speed up tasks like data processing, lab experiments, and simulations—so research that used to take months can be done in days.
• Improved Decision-Making
With real-time data and advanced analytics, researchers can make smarter decisions and adjust their work based on what the data shows.
• Cost Efficiency
By automating processes and using smart systems, research becomes more affordable and resource-efficient over time.
• Remote Collaboration & Access
Cloud computing and connected systems make it easy for researchers across the world to work together and access shared data and tools.
• Customized Solutions
Industry 4.0 allows for more personalized research, especially in fields like healthcare, where treatments can be tailored to individuals based on data.
METHODS USED
• Simple linear regression
• Multivariate linear regression
• Multivariate logistic regression
• Statistical process control
• Control charts
SOFTWARE USED
• Minitab
• SPSS
• R


Data Collection Patient Selection
• The study utilizes the 2010–2017 NRD sets provided in the HCUP by the Agency for Healthcare Research and Quality (AHRQ) in the U.S.
• HCUP is a family of databases that include the following databases.
• Kids’ Inpatient Database (KID)
• National (Nationwide) Inpatient Sample (NIS)
• Nationwide Ambulatory Surgery Sample (NASS)
• Nationwide Emergency Department Sample (NEDS)
• Nationwide Readmissions Database (NRD)
• State Inpatient Databases (SID)
• State Ambulatory Surgery and Services Databases (SASD)
• State Emergency Department Databases (SEDD)
Nationwide Readmissions Database (NRD)

• NRD primarily supports analysis of readmission rates on all payers and the uninsured at the national level, contains data from approximately 18 million discharges each year, and is well suited to inform decisions at all levels.

Assumptions of Regression
1. Linear relationship (scatter plots)
2. Multivariate normality (Assumes residuals are normally distributed)
3. No or little multicollinearity (Independent variables are expected to have no correlation between each other)
4. No autocorrelation (How much the following residuals are related to their previous residuals)
5. Homoscedasticity (Relationships and patterns in the residuals)
Linear relationship (scatter plots)
• There is a linear relationship between the age and LOS variables as presented in the scatter plot.

Multivariate normality (Assumes residuals are normally distributed)

After transforming the variable multiple times, multivariate normality could not be achieved.


No or Little Multicollinearity
To ensure that the independent variables are not highly correlated with one another, which allows the model to estimate unique effects for each predictor.
We ran multicollinearity tests and were able to get VIF values for each test.


Autocorrelation of independent variables

• Residuals are independent across observations, meaning there is no autocorrelation—the error at one time point should not predict the error at another.
• Durbin-Watson test
• data: model
• DW = 1.6919, p-value < 2.2e-16
• alternative hypothesis: true autocorrelation is greater than 0
• The p value is less than .05, therefore, we conclude that the independent variables are autocorrelated. Therefore, the error residuals are correlated to each other.
Homoscedasticity of independent variables
• Model output
• lm(formula = los ~ age + charlindex, data = NRD)
Residuals:
-10.50 -1.30 -0.50 0.36 360.36
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.0736046 0.0092641 -7.945 1.94e-15 *** age 0.0444437 0.0001379 322.247 < 2e-16 *** charlindex 0.5328875 0.0012639 421.631 < 2e-16 *** Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.1 on 4160549 degrees of freedom
Multiple R-squared: 0.07607, Adjusted R-squared: 0.07607
F-statistic: 1.713e+05 on 2 and 4160549 DF, p-value: < 2.2e-16
The p value is less than .05, therefore, we conclude that the independent variables are heteroscedastic. Therefore, residual errors' variance are non-constant.
Hypothesis 1 - A THA/TKA patient’s LOS affects their chances of readmission.




What this means:
This is analyzing whether patients were readmitted or not.
• "1" means yes, they were readmitted (about 192k patients).
• "0" means no, they were not readmitted (about 3.97 million patients).
• There are over 4.16 million total patient records in this dataset.
What this means:
• This is the formula the model uses to calculate the probability of being readmitted, based on how many days (LOS) the patient stayed in the hospital.
What this means:
• The coefficient for LOS is 0.077607, which means that for every additional day in the hospital, the log-odds of readmission go up.
• The P-value is 0.000, meaning this result is statistically significant.
• VIF = 1.00 → No multicollinearity, meaning LOS is not overlapping with another variable.
What this means:
• An odds ratio > 1 means LOS increases the chance of readmission.
• Here, for every 1 extra day in the hospital, the odds of being readmitted go up by about 8.07%.
• The confidence interval does not contain 1, meaning that the odds ratio is statistically significant.
Hypothesis 2:

Interpretation:
For every year older the patient is, they have an increased odds of readmission of 3.54%, holding all else constant.

THA/TKA patient’s age affects their chances of readmission .

Interpretation:

Interpretation:
•Intercept (-5.4338): This is the constant term, which represents the baseline value of "readmission" when the age is zero. In this case, even though "age" cannot realistically be zero, this number sets the baseline prediction of readmission before considering the effect of age.
•Slope (0.034798): This coefficient shows the change in "readmission" for each additional year of age. For each year older the person is, the readmission score will increase by about 0.034798.
P-Value(0.000): Since the p-value of 0.000 is less than the alpha value (0.05), so there is proof that this is statistically significant.
The model explains only 1.68% of the variation in readmission among THA/TKA patients based on age. This means that the age alone is a weak predictor of readmission.
Hypothesis 3: THA/TKA patients’ gender affects their chances of readmission.
Model Used: Binary Logistic Regression

• Y' refers to the predicted readmission.
• Female_0 refers to male.
• Female_1 refers to female.
• Female_0 has a slope of 0 because it is the dummy variable being compared to.

• R-squared adjusted with 0.02% indicates that 0.02% of the variance in the readmission is explained by the patient's gender.
Hypothesis 3 Continued



• The VIF (The Variance Inflation Factor) the female variable is 1.00, which indicates no multicollinearity. (variables are not correlated with each other).

• The Confidence Interval is statistically significant (doesn’t include 1).
• An odds ratio of 0.9113 indicates that females have an 8.87% lower chance of readmission.
• The test is checking whether the variable is statistically significant.
• With P value being less then 0.05 (α value), we can say that the female variable is significant.
Hypothesis 4 : THA/TKA patient's number of additional chronic conditions affects their chances of readmission.
Model Used: Binary Logistic Regression

• Rows Used: 2,181,926 These are the data points that were included in the analysis.
• Rows Unused: 1,978,629 These rows were excluded, due to missing values or filtering based on model.

• The data was only available between 2010-2015

• Each chronic condition increases the log-odds by 0.1415.
• The effect of nchronic is highly significant (p < 0.001).
• No multicollinearity concern (VIF = 1.00).
Hypothesis 4 Continued


• R-squared adjusted with 5.6% indicates that 5.6% of the variance in the readmission is explained by the patient's number of chronic conditions.
• The odds ratio shows a 15.2% higher chance of readmission PER chronic condition.
Hypothesis 5
THA/TKA patient's income level affects their chances of readmission. We are unable to get reliable data for this hypothesis because of difficulties cleaning the data from negative values of zipinc_qrtl (the patient’s income variable) as well as difficulty accessing Minitab. This highlights some of the missing entries and missing values in this database, emphasizing the importance of cleaning the data before use.

Source: Google Unveils New $750M Data Center As Part Of $9.5B Goal | CRN
Hypothesis 6

Source: Nonprofit hospitals under growing scrutiny over how they justify billions in tax breaks | CNN
THA/TKA patients are more likely to be readmitted, if they are admitted by private-not profit hospitals.
FIRST REGRESSION- NON-PROFIT VS GOVERNMENT & FOR PROFIT

The results show that relative to those in governmentowned hospitals, those in private non-profit hospitals have a 9.7% lower chance of being readmitted, while those in private for-profit hospitals had a 9.3% higher chance of being readmitted. H_contrl_1 is used to make the comparison; they are all dummy variables.

Source: Non-Profit vs. For-Profit: Which Business Model Is Best for Local News? | Web Publisher PRO
ODDS RATIO FOR HOSPITAL OWNERSHIP

Relative to private for-profit hospitals, being admitted into a private non-profit hospital decreases readmission odds by 21%. The fact that the 95% CI does not cover 1 mean that these values are statistically significant for all 3 odds ratios.
Hypothesis 6 Cont.
Summary of Findings/ Conclusion for Hypothesis 6
The adjusted R-squared value of 0.06% means that the ownership type of the hospital is responsible for 0.06% of the variance in the readmission rates of THA & TKA patients.

The private non-profit hospitals have lower readmission rates than government hospitals and for-profit hospitals, with private for-profit hospitals having the highest readmission rates. The original hypothesis is therefore false.
Source: Norwalk Hospital
Hypothesis 7
LOS of THA/TKA patients varies by age, gender, medical comorbidities, insurance type, discharge position, discharge month, day of admissions, ownership of hospital, hospital bed size, elective/non-elective cases, teaching status of hospital, and number of hospitals in the sample for the stratum between 2010 and 2017 in NRD.

Hypothesis 7 Significance
Only 14.66% of variance in LOS can be explained by these explanatory variables, the rest of the variance can be explained by other variables that are not entered in this estimation model. It refers to unobserved heterogeneity issue. All explanatory variables are statistically significant based off P value < .05 and VIF < 10, which meets the multicollinearity assumption in linear regression.


Hypothesis 8

Readmission of THA/TKA patients varies by age, gender, medical comorbidities, insurance type, discharge position, discharge month, day of admissions, ownership of hospital, hospital bed size, elective/non-elective cases, teaching status of hospital, and number of hospitals in the sample for the stratum between 2010 and 2017 in NRD.

Logistic Regression Significance

The odds ratio tells us how much more likely something is in one group compared to another, and the confidence interval shows the range where the true answer probably lies.
CIs are statistically significant for all, except for selfpayers since the CI covers 1.

Logistic Regression Result
Since p values are < .05 and VIF values are < 10, this shows that the model meets our regression assumptions. Therefore, THA/TKA patients varies by variables listed in the model.


Regression - Highlights


Hypothesis 8 Conclusions

After conducting many tests using the regression analysis methods we found these variables have a higher effect on readmission than others:
• The number of medical conditions
• Private Insurance
• Transfers to a short-term hospital
• Self Payers
Quality Control of LOS through I-MR Charts

I Chart (Model 1 - 171 patients)


MR Chart (Model 1 - 171 patients)
Findings from I-MR Charts

• As the charts on the previous slide show, both the individual (I) chart and the moving range (MR) chart have red dots, which means that both processes are statistically out of control.
• While there are some points above the UCL, points are also out of control because they break other control rules, such as eight consecutive points below the center line, 2 of the last 3 points above +2 sigma, and 4 of the last 5 points being above +1 sigma.

Source: What is I-MR Chart? How to create in MS Excel? With Excel
Conclusion
THA and TKA are a common procedure affecting around 1.3 million Americans per year. There are many significant factors that contribute to the readmission rate and LOS of these patients, indicating a need for further research regarding these patients to improve their outcomes.

REFERENCES
• American Academy of Orthopaedic Surgeons. (2021). Total hip replacement. https://orthoinfo.aaos.org/en/treatment/total-hip-replacement/
• American Academy of Orthopaedic Surgeons. (2021). Total knee replacement. https://orthoinfo.aaos.org/en/treatment/total-knee-replacement/
• Csense Management Solutions. (2021, September 4). What is an I-MR Chart? How to create in MS Excel? CSense Management Solutions Pvt Ltd. https:// www.csensems.com/i-mr-chart-excel/
• Haranas, M. (2022, April 25). Google Unveils New $750M Data Center As Part of $9.5B Goal. The Channel Co.- CRN. https://www.crn.com/news/data-center/googleunveils-new-750m-data-center-as-part-of-9-5b-goal
• Laney, D. (2001). 3D Data Management: Controlling Data Volume, Velocity, and Variety. META Group Research Note.
• Miller A. & Hawryluk M. (2023, July 10). Nonprofit hospitals under growing scrutiny over how they justify billions in tax breaks. CNN Health. https://www.cnn.com/ 2023/07/10/health/nonprofit-hospitals-community-benefits-kff-health-news
• Moeuf, A., Pellerin, R., Lamouri, S., Tamayo-Giraldo, S., & Barbaray, R. (2018). The industrial management of SMEs in the era of Industry 4.0. International Journal of Production Research, 56(3), 1118–1136.
• Nuvance Health (n.d.). Norwalk Hospital. Nuvance Health. https://www.nuvancehealth.org/locations/norwalk-hospital
• Pakdil, F. , Muchiri, S. and Azadeh-Fard, N. (2025), "The digital voice of patients and Big Data in the age of Quality 4.0", International Journal of Quality & Reliability Management, Vol. ahead-of-print No. ahead-of-print. https://doi.org/10.1108/IJQRM-09-2024-0324
• Rehman, A., Naz, S. and Razzak, I. (2022), “Leveraging big data analytics in healthcare enhancement: trends, challenges and opportunities”, Multimedia Systems, Vol. 28 No. 4, pp. 1339-1371, doi: 10.1007/s00530-020-00736-8
Q & A
Thank you for listening!
