Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
In this case study we will :
- Understand how different features affect wine quality.
- Predict the quality of wine.
The dataset I picked from Kaggle has 4898 observations of 12 variables.
-Fixed.acidity: a measurement of the total concentration of titratable acids.
-Volatile.acidity: a measure of steam distillable acids.
-Citric.acid: one of the many acids that are measured to obtain fixed acidity.
-Residual.sugar: measurement of any natural grape sugars that are left over after fermentation ceases.
-Chlorides: the amount of salt in the wine.
-Free.sulfur.dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion.
-Total.sulfur dioxide: amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
-Density: measure of density of wine.
-pH: value for pH.
-Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
-Alcohol: the percentage of alcohol present in the wine.
-Quality: subjective measurement ranging from 1 to 10 (although the observed data ranges from 3 to 8).
Checking the relationship between the variables using:
Correlation: is a statistical calculation of the strength of two variables. The value of a correlation coefficient ranges between -1 and 1.
- Correlation is Positive when the values increase together.
- Correlation is Negative when one value decreases as the other increases.
- There is a very strong positive relationship between density and residual sugar 0.84.
- There is a strong negative relationship between alcohol and density 0.78.
- There is a strong positive relationship between free sulfur dioxide and total sulfur dioxide 0.62.
Now let us reduce the levels of quality to three:
- Low for quality equal 3,4 or 5.
- Medium for quality equal 5 or 6.
- High for quality equal 7 or higher.
isprocess of re-scaling original data without changing its behavior or nature. We define new boundary (0,1).
Pros: Eliminate redundancy , Organize data efficiently, Reduce the potential for data anomalies.
Building The Predictive Model
Supervised Learning Algorithms
kNN-Algorithm: k nearest neighbours is a non linear machine learning algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbours.
Cons: it takes a lot of computational power as it executes the data in bulk.
Now we can predict the wine quality for new observations using this model.
- The most important feature that is affecting the wine quality is the percentage of alcohol in the wine.
- Building a predictive model using k-fold cross validation with kNN-algorithm will give 78% accuracy.
Full code: https://github.com/SaraKmair/WineQuality
Sara Kmair is a passionate problem solver, challenge seeker and a highly motivated data analyst with a Bachelor’s degree in Mechatronics, Robotics and Automation Engineering from Tishreen University, Latakia.