Data analysis of item quality with the help of SPSS and RStudio

WINE QUALITY ANALYSIS

BMAN70202 Business Improvement Tools Techniques and Systems

Student Numbers:

11387272

10735596

11346986

11464238

11036554

10994228

❖Executive Summary

This analysis examines one dataset with two datasheets, one for Red wine with 1599 entries and another for White wine with 4898 entries, to understand how different factors affect wine quality. Both datasets include similar information such as acidity levels, sugar, sulfur dioxide, alcohol content, and a quality score for each wine. We used statistical methods, such as correlations and different predictive models (e.g. Linear regression, KNN, and Random Forest), to identify the main factors that determine wine quality and to see if there are any notable differences between Red and White wines. Our findings show that certain factors, like acidity and alcohol content, are critical in deciding the quality of a wine. We also noticed some clear differences in the chemical properties of Red and White wines. These results are helpful for wine producers who want to improve their products and for consumers who are interested in knowing more about what makes a good wine. Additionally, our study provides a foundation for future research in winemaking and could help experts in this field explore new ways to enhance wine quality. Overall, this analysis gives valuable insights into the world of wine, offering practical information for both the production and selection of quality wines.

Wine Quality Analysis

1 2 3 4

Data Overview

4,898 entries 12 chemical variables

All numerical columns

1599 entries 12 chemical variables

All numerical columns

White Wine

Red Wine

Data Cleaning

Exploratory Data Analysis (EDA)

Variable distribution - Part 1

Histograms are an excellent visualization tool, helping to understand the distribution entities across the data set.

After exploring the data, few key variables were identified including fixed acidity and alcohol content.

It can be observed that the high disparity of entities dictates that the data is unevenly distributed.

Which require further data processing to improve the predictive analysis.

Variable distribution - Part 2

Ø Box plots and relation to outliers

Box plots is another visualization tool that we use which emphasizes:

1. Central Tendency: Reflecting the median of the red and white wine dataset

2. Variability: Showing the spread of the data through quartiles.

3. Outliers: Highlighting data points that fall outside the typical range

Key Findings after multiple analysis of the data:

1. Identified four major variables with significant outliers.

2. The major variables require further data processing in order to have accurate predictive analysis.

Correlations Matrix

These matrixes simply helps to identify the strength, direction, and relationships between pairs of variables

White Wine quality correlations

Positive

Through the correlation matrix analysis, we are able to determine four variables which have the strongest associations with white wine quality:

1. Alcohol: it is positively correlated to quality (R=0.44) which implies that higher alcohol content is moderately associated with higher quality ratings.

2. Acidity: The graph shows a slight positive correlation between pH and quality (R=0.1) which indicates that pH slight influnces quality perception.

3. Chlorides: it is negatively correlated to quality (R=-0.21) which implies that higher salinity is often linked to lower quality.

4. Density: it is also negatively correlated to quality (R=-0.31) which suggests that wines with lower density, which typically have less residual sugar, may be rated higher.

Negative

Red Wine quality correlations

Positive

Through the correlation matrix analysis, we are able to determine four variables which have the strongest associations with white wine quality:

1. Alcohol: Strong positive correlation (R = 0.48) suggests higher alcohol content significantly influences better quality ratings in red wine.

2. Sulphates: Moderate positive correlation (R = 0.25) indicates that sulphates, which aid in preservation, are associated with higher quality.

3. Acidity: Notably negative correlation (R =0.39) with volatile acidity highlights its detrimental effect on quality perception.

4. Sulfur: Negative correlation (R = -0.18) with total sulfur dioxide suggests a potential negative impact on quality with higher concentrations.

Negative

Further Data processing

Outlier Removal

Execution: Outliers were systematically removed from each variable in the datasets, excluding the 'quality' variable, which was considered the target variable for potential predictive modeling. This resulted in a cleaner dataset with fewer extreme values, likely to provide a more accurate representation of the typical chemical profile of the wines.

Methodology: The Interquartile Range (IQR) method was employed to detect outliers. This involved calculating the IQR for each variable and defining outliers as those data points that fell below the lower bound and above the upper bound.

Post-outlier removal: The removal process reduced the Red Wine dataset from 1,599 to 1,135 entries (a reduction of approximately 29%) and the White Wine dataset from 4,898 to 3,973 entries (a reduction of approximately 18.9%). These cleaned datasets are expected to offer a more robust basis for further analysis, as they are less influenced by extreme outlier values.

Predictive Analysis

Comparison of Regression models

Ø Three regression models are applied to predict the wine quality, including multiple regression, KNN and random forest.

Ø Each model was evaluated using metrics such as mean absolute value, mean squared value and R-squared.

Feature Engineering

Ø To further to enhance model training and prediction accuracy, improvement techniques such as feature engineering was conducted.

Ø It can be observed that the performance of the model has been slightly improved by such technique as shown in the red wine values.

Ø However, values of the white wine haven’t experienced similar improvement and remained identical.

Recommendations

Managers should pay close attention to variables directly correlated with wine quality, such as acidity variables, 'Alcohol content', and sulphates. Also, they need to keep an eye on the other multicollinearities as well as any adjustment to these variables can have an indirect impact on quality.

The dataset at hand provides several variables that are known to affect wine quality; however, the addition of variables can greatly improve the predictive accuracy of the mode when:

Grape Variety: Including grape types would allow for an analysis of how specific varieties correlate with wine quality

Vintage Year: Including the year of production may reveal how age influence the quality.

Climate Data: Information on weather conditions during the growing and harvesting periods can be critical in determining grape quality.

In the relationship to the second recommendation, it is suggested that future marketing campaign use the data collected in a way that allows the company to effectively expand their business outreach as well as educating the consumers and raising their awareness about the quality of wine

Q&A

Turn static files into dynamic content formats.

Create a flipbook