Data import
Exploratory data analysis (looking for missing and duplicated data points)
Specifying the performance metrics (Confusion matrix, Precision, Recall, and accuracy in this case)
Splitting the data into the training set and testing set
Training Random forest model on the training set and testing on the testing set
Assessing the performance metrics of the developed models
The following is the exploratory data analysis I implemented:
Weight Boxplots for Each Obesity Level:
Analysis of weight distributions across different obesity levels reveals distinct patterns.
For obesity types one through three, weight ranges approximately from 70 to 160 pounds, indicating variability within each category.
Overweight levels exhibit weight ranges of about 50 to 100 pounds, with notable variations.
We have some outliers across categories overweight level II, obesity type II, and obesity type III.
Differences in minimum, maximum, and median weights are observed across the various obesity levels, underscoring the heterogeneity within each category.
The correlation of all features with each obesity level: The correlation plot shows that the features that are highly correlated with obesity levels are Gender, Height, Weight, FCVC, CAEC, CALC
Family history of overweight with Obesity level: we can conclude from the above plot that the persons having an obesity type one and above are most likely to have a family history of overweight.
Gender with Obesity: By analyzing the gender we can see that obesity type_II is almost all males and obesity type_III are almost all females and other types are about 50% males and 50% females, except for overweight level_II the males' percent are more than females.
Modeling:
For the machine learning algorithms, I compared the performance of two models, the first is Support Vector Machines (SVM), and the other is Random Forest (RF), I used Grid Search for the hyperparameter optimization of both models and concluded that C=10, and gamma=0.1 for the SVM model, while the number of trees is 300 for the RF model. The following tables show the classification reports for both models.
RF Model
SVM Model
Performance Evaluation
Random Forest Results:
Precision: The precision values for each class range from 0.77 to 1.00. Random Forest achieves high precision across most classes, indicating low false positive rates.
Recall: The recall values range from 0.82 to 0.99. Random Forest exhibits high recall rates, indicating low false negative rates across most classes.
F1-score: The F1-scores range from 0.84 to 0.99, reflecting a balance between precision and recall for each class.
Accuracy: The overall accuracy of the Random Forest model is 0.93, indicating high predictive performance across all classes.
Support Vector Machine (SVM) Results:
Precision: The precision values for each class range from 0.78 to 1.00, indicating the proportion of correctly predicted instances among all instances predicted as belonging to that class. Higher precision values indicate fewer false positives.
Recall: The recall values range from 0.74 to 0.99, representing the proportion of correctly predicted instances of each class among all actual instances of that class. Higher recall values indicate fewer false negatives.
F1-score: The F1-score, which is the harmonic mean of precision and recall, ranges from 0.76 to 0.99. It provides a balance between precision and recall and is useful when the classes are imbalanced.
Accuracy: The overall accuracy of the SVM model is 0.91, indicating the proportion of correctly predicted instances across all classes.
Comparison:
Accuracy: The Random Forest model outperforms the SVM model in terms of overall accuracy, achieving an accuracy of 0.93 compared to 0.91 for SVM.
Precision and Recall: Both models exhibit high precision and recall values across most classes. However, the Random Forest model generally achieves slightly higher precision and recall values compared to the SVM model.
F1-score: Both models demonstrate high F1-scores, indicating a good balance between precision and recall for each class.
Robustness: The Random Forest model appears to be more robust and performs consistently well across all classes compared to the SVM model.
In summary, while both models demonstrate strong predictive performance, the Random Forest model emerges as the superior choice due to its higher accuracy and robustness.
The code and data information are available in the following GitHub repository:
https://github.com/israajashaami/Codes/blob/main/Obesity.ipynb