Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies
DOI: http://dx.doi.org/10.62527/joiv.8.3.2283
Abstract
The Synthetic Minority Oversampling Technique (SMOTE) method is the baseline for solving unbalanced data problems. The working concept of the SMOTE method is to generate new synthetic data patterns by performing linear interpolation between minority class samples based on k-nearest neighbors. However, the SMOTE method has weaknesses, namely the problem of overgeneralization due to excessive sampling of sample noise and increased overlapping between classes in the decision boundary area, which has the potential for noise data. Based on the weaknesses of the Smote method, the purpose of this research is to conduct a systematic literature review on the Smote method modification approach in solving unbalanced data. This systematic literature review method comprises keyword identification, article search process, determination of selection criteria, and selection results based on criteria. The results of this study showed that the SMOTE modification approach was based on filtering, clustering, and distance modification to reduce the resulting noise data. The filtering approach removed the noise data before SMOTE, positively impacting resolving unbalanced data. Meanwhile, the use of a clustering approach in SMOTE can minimize the overlapping artificial minority data that has noise potential. The most used datasets are Pima 60% and Haberman 50%. The most used performance evaluation on unbalanced data is f1-measure 57%, accuracy 55%, recall 43%, and AUC 27%. The implication of the results of this literature review is to provide opportunities for further research in modifying SMOTE in addressing health data imbalances, especially handling noise and overlapping data. The thoroughness of our literature review should instill confidence in the research community.
Keywords
Full Text:
PDFReferences
M. Khushi et al., “A Comparative Performance Analysis of Data Resampling Methods on Imbalance Medical Data,” IEEE Access, vol. 9, pp. 109960–109975, 2021, doi: 10.1109/ACCESS.2021.3102399.
M. Naseriparsa, A. Al-Shammari, M. Sheng, Y. Zhang, and R. Zhou, “RSMOTE: improving classification performance over imbalanced medical datasets,” Health Information Science and Systems., vol. 8, no. 1, pp. 1–13, 2020, doi: 10.1007/s13755-020-00112-w.
C. Yang, E. A. Fridgeirsson, J. A. Kors, J. M. Reps, and P. R. Rijnbeek, “Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data,” Journal of Big Data, vol. 11, no. 1, p. 7, Jan. 2024, doi: 10.1186/s40537-023-00857-7.
H. A. Gameng, B. D. Gerardo, and R. P. Medina, “A modified adaptive synthetic smote approach in graduation success rate classification,” International Journal of Advanced Trends in Computer Science and Engineering (IJATCSE)., vol. 8, no. 6, pp. 3053–3057, 2019, doi: 10.30534/ijatcse/2019/63862019.
R. Malhotra and K. Lata, “An empirical study on predictability of software maintainability using imbalanced data,” Software Quality Journal., vol. 28, no. 4, pp. 1581–1614, 2020, doi: 10.1007/s11219-020-09525-y.
D. Elreedy, A. F. Atiya, and F. Kamalov, “A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning,” Machine Learning., no. January, pp. 1–21, 2023, doi: 10.1007/s10994-022-06296-4.
A. Gosain and S. Sardana, “Farthest SMOTE: A Modified SMOTE Approach,” in Advances in Intelligent Systems and Computing, vol. 711, 2019, pp. 309–320. doi: 10.1007/978-981-10-8055-5_28.
A. Puri and M. Kumar Gupta, “Knowledge discovery from noisy imbalanced and incomplete binary class data,” Expert Systems with Applications., vol. 181, no. March 2020, pp. 1–14, 2021, doi: 10.1016/j.eswa.2021.115179.
N. A. Azhar, M. S. Mohd Pozi, A. Mohamed Din, and A. Jatowt, “An Investigation of SMOTE based Methods for Imbalanced Datasets with Data Complexity Analysis,” IEEE Transactions on Knowledge and Data Engineering., pp. 1–1, 2022, doi: 10.1109/TKDE.2022.3179381.
L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, “Review of Classification Methods on Unbalanced Data Sets,” IEEE Access, vol. 9, pp. 64606–64628, 2021, doi: 10.1109/ACCESS.2021.3074243.
A. V. Vitianingsih, Z. Othman, S. S. K. Baharin, A. Suraji, and A. L. Maukar, “Application of the Synthetic Over-Sampling Method to Increase the Sensitivity of Algorithm Classification for Class Imbalance in Small Spatial Datasets,” International Journal of Intelligent Engineering and Systems., vol. 15, no. 5, pp. 676–690, 2022, doi: 10.22266/ijies2022.1031.58.
D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Information Sciences., vol. 505, pp. 32–64, 2019, doi: 10.1016/j.ins.2019.07.070.
T. Hasanin, T. M. Khoshgoftaar, J. L. Leevy, and R. A. Bauder, “Severely imbalanced Big Data challenges: investigating data sampling approaches,” Journal of Big Data, vol. 6, no. 1, pp. 1–25, 2019, doi: 10.1186/s40537-019-0274-4.
A. Anggrawan, H. Hairani, and C. Satria, “Improving SVM Classification Performance on Unbalanced Student Graduation Time Data Using SMOTE,” International Journal of Information and Education Technology (IJIET)., vol. 13, no. 2, pp. 289–295, 2023.
R. Malhotra and K. Lata, “Handling class imbalance problem in software maintainability prediction: an empirical investigation,” Frontiers of Computer Science., vol. 16, no. 4, pp. 1–14, Aug. 2022, doi: 10.1007/s11704-021-0127-0.
K. S. Babu and Y. N. Rao, “A Study on Imbalanced Data Classification for Various Applications,” Revue d'Intelligence Artificielle., vol. 37, no. 2, pp. 517–524, 2023.
P. Mooijman, C. Catal, B. Tekinerdogan, A. Lommen, and M. Blokland, “The effects of data balancing approaches: A case study,” Applied Soft Computing., vol. 132, p. 109853, 2023, doi: 10.1016/j.asoc.2022.109853.
S. Feng, J. Keung, X. Yu, Y. Xiao, and M. Zhang, “Investigation on the stability of SMOTE-based oversampling techniques in software defect prediction,” Information and Software Technology., vol. 139, no. June, p. 106662, 2021, doi: 10.1016/j.infsof.2021.106662.
D. Elreedy and A. F. Atiya, “A Comprehensive Analysis of Synthetic Minority Oversampling Technique (SMOTE) for handling class imbalance,” Information Sciences., vol. 505, pp. 32–64, Dec. 2019, doi: 10.1016/j.ins.2019.07.070.
X. Yuan, S. Chen, H. Zhou, C. Sun, and L. Yuwen, “CHSMOTE: Convex hull-based synthetic minority oversampling technique for alleviating the class imbalance problem,” Information Sciences., vol. 623, pp. 324–341, 2023, doi: 10.1016/j.ins.2022.12.056.
P. Soltanzadeh and M. Hashemzadeh, “RCSMOTE: Range-Controlled synthetic minority over-sampling technique for handling the class imbalance problem,” Information Sciences., vol. 542, pp. 92–111, 2021, doi: 10.1016/j.ins.2020.07.014.
S. Rezvani and X. Wang, “A broad review on class imbalance learning techniques,” Applied Soft Computing., vol. 143, pp. 1–23, 2023, doi: 10.1016/j.asoc.2023.110415.
V. Werner de Vargas, J. A. Schneider Aranda, R. dos Santos Costa, P. R. da Silva Pereira, and J. L. Victória Barbosa, “Imbalanced data preprocessing techniques for machine learning: a systematic mapping study,” Knowledge and Information Systems., vol. 65, no. 1, pp. 31–57, 2023, doi: 10.1007/s10115-022-01772-8.
A. Islam, S. B. Belhaouari, A. U. Rehman, and H. Bensmail, “KNNOR: An oversampling technique for imbalanced datasets,” Applied Soft Computing., vol. 115, pp. 1–18, 2022, doi: 10.1016/j.asoc.2021.108288.
Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, “An oversampling algorithm combining SMOTE and k-means for imbalanced medical data,” Information Sciences., vol. 572, no. 5, pp. 574–589, Sep. 2021, doi: 10.1016/j.ins.2021.02.056.
A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, “RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification,” Journal of King Saud University - Computer science., vol. 34, no. 8, pp. 5059–5074, 2022, doi: 10.1016/j.jksuci.2022.06.005.
Q. Dai, J. wei Liu, and J. L. Zhao, “Distance-based arranging oversampling technique for imbalanced data,” Neural Computing and Applications., vol. 35, no. 2, pp. 1323–1342, 2023, doi: 10.1007/s00521-022-07828-8.
S. Feng, J. Keung, P. Zhang, Y. Xiao, and M. Zhang, “The impact of the distance metric and measure on SMOTE-based techniques in software defect prediction,” Information and Software Technology., vol. 142, no. January, pp. 1–14, 2022, doi: 10.1016/j.infsof.2021.106742.
A. Balakrishnan, J. Medikonda, P. K. Namboothiri, and M. Natarajan, “Mahalanobis Metric-based Oversampling Technique for Parkinson’s Disease Severity Assessment using Spatiotemporal Gait Parameters,” Biomedical Signal Processing and Control, vol. 86, no. September, pp. 1–14, 2023, doi: 10.1016/j.bspc.2023.105057.
X. Yi, Y. Xu, Q. Hu, S. Krishnamoorthy, W. Li, and Z. Tang, “ASN-SMOTE : a synthetic minority oversampling method with adaptive qualified synthesizer selection,” Complex & Intelligent Systems., vol. 8, no. 3, pp. 2247–2272, 2022, doi: 10.1007/s40747-021-00638-w.
A. Asniar, N. U. Maulidevi, and K. Surendro, “SMOTE-LOF for noise identification in imbalanced data classification,” Journal of King Saud University - Computer science., vol. 34, no. 6, pp. 3413–3423, Jun. 2022, doi: 10.1016/j.jksuci.2021.01.014.
R. Parente Da Costa, E. Di. Canedo, R. T. De Sousa, R. De Oliveira Albuquerque, and L. J. Garcia Villalba, “Set of Usability Heuristics for Quality Assessment of Mobile Applications on Smartphones,” IEEE Access, vol. 7, no. April, pp. 116145–116161, 2019, doi: 10.1109/ACCESS.2019.2910778.
N. Hussain, H. T. Mirza, G. Rasool, I. Hussain, and M. Kaleem, “Spam review detection techniques: A systematic literature review,” Applied Sciences., vol. 9, no. 5, pp. 1–26, 2019, doi: 10.3390/app9050987.
J. Park, S. Kwon, and S. P. Jeong, “A study on improving turnover intention forecasting by solving imbalanced data problems: focusing on SMOTE and generative adversarial networks,” Journal of Big Data, vol. 10, no. 1, pp. 1–16, 2023, doi: 10.1186/s40537-023-00715-6.
X. W. Liang, A. P. Jiang, T. Li, Y. Y. Xue, and G. T. Wang, “LR-SMOTE — An improved unbalanced data set oversampling based on K-means and SVM,” Knowledge-Based Systems., vol. 196, no. May, pp. 1–10, May 2020, doi: 10.1016/j.knosys.2020.105845.
S. Feng, C. Zhao, and P. Fu, “A cluster-based hybrid sampling approach for imbalanced data classification,” Review of Scientific Instruments., vol. 91, no. 5, pp. 1–9, 2020, doi: 10.1063/5.0008935.
G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, “Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning from Imbalanced Data,” IEEE Access, vol. 9, pp. 74763–74777, 2021, doi: 10.1109/ACCESS.2021.3080316.
M. Revathi and D. Ramyachitra, “A Modified Borderline Smote with Noise Reduction in Imbalanced Datasets,” Wireless Personal Communications., vol. 121, no. 3, pp. 1659–1680, 2021, doi: 10.1007/s11277-021-08690-y.
Q. Liu et al., “Application of KM-SMOTE for rockburst intelligent prediction,” Tunnelling and Underground Space Technology., vol. 138, no. October, pp. 1–10, 2023, doi: 10.1016/j.tust.2023.105180.
V. P. K. Turlapati and M. R. Prusty, “Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19,” Intelligence-Based Medicine., vol. 3–4, no. November, pp. 1–10, 2020, doi: 10.1016/j.ibmed.2020.100023.
T. G.S., Y. Hariprasad, S. S. Iyengar, N. R. Sunitha, P. Badrinath, and S. Chennupati, “An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets,” Machine Learning with Applications., vol. 8, no. January, pp. 1–12, 2022, doi: 10.1016/j.mlwa.2022.100267.
H. Hairani and D. Priyanto, “A New Approach of Hybrid Sampling SMOTE and ENN to the Accuracy of Machine Learning Methods on Unbalanced Diabetes Disease Data,” International Journal of Advanced Computer Science and Applications., vol. 14, no. 8, pp. 585–590, 2023.
H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link,” JOIV : International Journal on Informatics Visualization., vol. 7, no. 1, pp. 258–264, 2023.
L. G. R. Putra, K. Marzuki, and H. Hairani, “Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction,” Engineering and Applied Science Research (EASR)., vol. 50, no. 6, pp. 577–583, 2023, doi: 10.14456/easr.2023.59.
K. Wang et al., “Improving Risk Identification of Adverse Outcomes in Chronic Heart Failure Using SMOTE+ENN and Machine Learning,” Risk Management and Healthcare Policy, vol. 14, no. June, pp. 2453–2463, Jun. 2021, doi: 10.2147/RMHP.S310295.
A. Zhang, H. Yu, Z. Huan, X. Yang, S. Zheng, and S. Gao, “SMOTE-RkNN: A hybrid re-sampling method based on SMOTE and reverse k-nearest neighbors,” Information Sciences., vol. 595, pp. 70–88, 2022, doi: 10.1016/j.ins.2022.02.038.
B. Chen, S. Xia, Z. Chen, B. Wang, and G. Wang, “RSMOTE: A self-adaptive robust SMOTE for imbalanced problems with label noise,” Information Sciences., vol. 553, pp. 397–428, 2021, doi: https://doi.org/10.1016/j.ins.2020.10.013.
Z. Wei, L. Zhang, and L. Zhao, “Minority-prediction-probability-based oversampling technique for imbalanced learning,” Information Sciences., vol. 622, pp. 1273–1295, 2023, doi: 10.1016/j.ins.2022.11.148.
J. Li, Q. Zhu, Q. Wu, and Z. Fan, “A novel oversampling technique for class-imbalanced learning based on SMOTE and natural neighbors,” Information Sciences., vol. 565, pp. 438–455, 2021, doi: 10.1016/j.ins.2021.03.041.
J. Liu, “Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data,” Soft Computing., vol. 26, no. 2, pp. 1141–1163, 2022.
R. Liu, “A novel synthetic minority oversampling technique based on relative and absolute densities for imbalanced classification,” Applied Intelligence., vol. 53, no. 1, pp. 786–803, 2023, doi: 10.1007/s10489-022-03512-5.
K. Cheng, C. Zhang, H. Yu, X. Yang, H. Zou, and S. Gao, “Grouped SMOTE with Noise Filtering Mechanism for Classifying Imbalanced Data,” IEEE Access, vol. 7, pp. 170668–170681, 2019, doi: 10.1109/ACCESS.2019.2955086.
A. S. Ghorab, W. M. Ashour, and S. I. Abudalfa, “An Adaptive Oversampling Method for Imbalanced Datasets Based on Mean-Shift and SMOTE,” in CBT 2022: Explore Business, Technology Opportunities and Challenges After the Covid-19 Pandemic, 2023, pp. 13–23. doi: 10.1007/978-3-031-08954-1_2.
S. Bej, N. Davtyan, M. Wolfien, M. Nassar, and O. Wolkenhauer, “LoRAS: an oversampling approach for imbalanced datasets,” Machine Learning., vol. 110, no. 2, pp. 279–301, 2021, doi: 10.1007/s10994-020-05913-4.
A. Zhang, H. Yu, S. Zhou, Z. Huan, and X. Yang, “Instance weighted SMOTE by indirectly exploring the data distribution,” Knowledge-Based Systems., vol. 249, no. August, pp. 1–24, 2022, doi: 10.1016/j.knosys.2022.108919.
D.-C. Li, S.-Y. Wang, K.-C. Huang, and T.-I. Tsai, “Learning class-imbalanced data with region-impurity synthetic minority oversampling technique,” Information Sciences., vol. 607, pp. 1391–1407, 2022, doi: https://doi.org/10.1016/j.ins.2022.06.067.
J. H. Joloudari, A. Marefat, M. A. Nematollahi, S. S. Oyelere, and S. Hussain, “Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks,” Applied Sciences., vol. 13, no. 6, pp. 1–34, Mar. 2023, doi: 10.3390/app13064006.
P. Kaur and A. Gosain, “FF-SMOTE: A Metaheuristic Approach to Combat Class Imbalance in Binary Classification,” Applied Artificial Intelligence., vol. 33, no. 5, pp. 420–439, 2019, doi: 10.1080/08839514.2019.1577017.
F. Duan, S. Zhang, Y. Yan, and Z. Cai, “An Oversampling Method of Unbalanced Data for Mechanical Fault Diagnosis Based on MeanRadius-SMOTE,” Sensors, vol. 22, no. 14, pp. 1–15, Jul. 2022, doi: 10.3390/s22145166.
G. Douzas and F. Bacao, “Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE,” Information Sciences., vol. 501, pp. 118–135, 2019, doi: 10.1016/j.ins.2019.06.007.
P. K. Jadwal, S. Jain, S. Pathak, and B. Agarwal, “Improved resampling algorithm through a modified oversampling approach based on spectral clustering and SMOTE,” Microsystem Technologies., vol. 28, no. 12, pp. 2669–2677, 2022, doi: 10.1007/s00542-022-05287-8.
H. Yi, Q. Jiang, X. Yan, and B. Wang, “Imbalanced Classification Based on Minority Clustering Synthetic Minority Oversampling Technique with Wind Turbine Fault Detection Application,” IEEE Transactions on Industrial Informatics, vol. 17, no. 9, pp. 5867–5875, 2021, doi: 10.1109/TII.2020.3046566.
J. Fonseca, G. Douzas, and F. Bacao, “Improving imbalanced land cover classification with k-means smote: Detecting and oversampling distinctive minority spectral signatures,” Information, vol. 12, no. 7, pp. 1–20, 2021, doi: 10.3390/info12070266.
G. Wei, W. Mu, Y. Song, and J. Dou, “An improved and random synthetic minority oversampling technique for imbalanced data,” Knowledge-Based Systems., vol. 248, no. July, pp. 1–13, 2022, doi: 10.1016/j.knosys.2022.108839.
H. Guan, Y. Zhang, M. Xian, H. D. Cheng, and X. Tang, “SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling,” Applied Intelligence., vol. 51, no. 3, pp. 1394–1409, Mar. 2021, doi: 10.1007/s10489-020-01852-8.
Q. Chen, Z. L. Zhang, W. P. Huang, J. Wu, and X. G. Luo, “PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets,” Neurocomputing, vol. 498, pp. 75–88, 2022, doi: 10.1016/j.neucom.2022.05.017.
E. Elyan, C. F. Moreno-Garcia, and C. Jayne, “CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification,” Neural Computing and Applications., vol. 33, no. 7, pp. 2839–2851, 2021, doi: 10.1007/s00521-020-05130-z.
W. Li, “Imbalanced data optimization combining K-means and SMOTE,” International Journal of Performability Engineering., vol. 15, no. 8, pp. 2173–2181, 2019, doi: 10.23940/ijpe.19.08.p17.21732181.
J. Arora et al., “MCBC-SMOTE: A Majority Clustering Model for Classification of Imbalanced Data,” Computers, Materials and Continua., vol. 73, no. 3, pp. 4801–4817, 2022, doi: 10.32604/cmc.2022.025960.
Y. Yang, H. Akbarzadeh Khorshidi, and U. Aickelin, “A Diversity-Based Synthetic Oversampling Using Clustering for Handling Extreme Imbalance,” SN Computer Science., vol. 4, no. 6, pp. 1–16, 2023, doi: 10.1007/s42979-023-02249-3.
K. Li et al., “A hybrid cluster-borderline SMOTE method for imbalanced data of rock groutability classification,” Bulletin of Engineering Geology and the Environment., vol. 81, no. 1, pp. 1–15, 2022, doi: 10.1007/s10064-021-02523-9.
S. Hooda and S. Mann, “Distributed synthetic minority oversampling technique,” International Journal of Computational Intelligence Systems., vol. 12, no. 2, pp. 929–936, 2019, doi: 10.2991/ijcis.d.190719.001.