Comparative Analysis of Imputation Methods for Enhancing Predictive Accuracy in Data Models

Nurul Aqilah Zamri - Faculty of Computing, Universiti Malaysia Pahang Al-Sultan Abdullah, Pekan, Malaysia
M. Izham Jaya - Faculty of Computing, Universiti Malaysia Pahang Al-Sultan Abdullah, Pekan, Malaysia
Indrarini Dyah Irawati - School of Applied Science, Telkom University, Bandung, Indonesia
Taha H. Rassem - School of Computer Science and Informatics, De Montfort University, Leicester, United Kingdom
- Rasyidah - Department of Information Technology, Politeknik Negeri Padang, Padang, Indonesia
Shahreen Kasim - Soft Computing and Data Mining Centre (SMC), Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor, Malaysia


Citation Format:



DOI: http://dx.doi.org/10.62527/joiv.8.3.1666

Abstract


The presence of missing values within datasets can introduce a detrimental bias, significantly impeding the predictive algorithm's ability to discern patterns and accurately execute prediction. This paper aims to elucidate the intricacies of data imputation methods, providing a more profound understanding of prevalent imputation methods, including list-wise deletion (IGN), mean imputation (AVG), K-Nearest Neighbors (KNN), MissForest (MF), and Predictive Mean Matching (PMM). The dataset employed in this study consists of financial data about S&P 500 companies in the Compustat North America database. The training and validation dataset encompasses 1973 instances, consisting of data during the fourth quarter of 2009, the first quarter of 2010, and the third quarter of 2014. Within this set, 457 missing values were identified and imputed. The test dataset comprises 197 randomly selected instances from the fourth quarter of 2014, equivalent to ten percent of the total instances in the training dataset. The evaluation findings prominently position the dataset derived from MF imputation as the leading performer among all the imputed datasets. The insights derived from this study are intended to assist practitioners in making informed choices when selecting the most suitable data imputation method, particularly in the context of predictive modeling tasks.

Keywords


Missing value; imputation; predictive modeling; machine learning

Full Text:

PDF

References


M. S. Gangadhar, K. V. S. Sai, S. H. S. Kumar, K. A. Kumar, M. Kavitha, and S. S. Aravinth, “Machine Learning and Deep Learning Techniques on Accurate Risk Prediction of Coronary Heart Disease,” in 2023 7th International Conference on Computing Methodologies and Communication (ICCMC), IEEE, Feb. 2023, pp. 227–232. doi:10.1109/ICCMC56507.2023.10083756.

X. Kong, W. Zhou, G. Shen, W. Zhang, N. Liu, and Y. Yang, “Dynamic graph convolutional recurrent imputation network for spatiotemporal traffic missing data,” vol. 261, p. 110188, 2023, doi:10.1016/j.knosys.2022.110188.

E. Getzen, L. Ungar, D. Mowery, X. Jiang, and Q. Long, “Mining for equitable health: Assessing the impact of missing data in electronic health records,” J Biomed Inform, vol. 139, p. 104269, Mar. 2023, doi:10.1016/J.JBI.2022.104269.

K. Psychogyios, L. Ilias, C. Ntanos, and D. Askounis, “Missing Value Imputation Methods for Electronic Health Records,” IEEE Access, vol. 11, pp. 21562–21574, 2023, doi: 10.1109/ACCESS.2023.3251919.

B. Agbo, H. Al-Aqrabi, T. Alsboui, M. Hussain, and R. Hill, “Imputation of Missing Clinical Covariates for Downstream Classification Problems,” IEEE Access, vol. 11, pp. 102935–102943, 2023, doi: 10.1109/ACCESS.2023.3317775.

P. Buczak, J. J. Chen, and M. Pauly, “Analyzing the Effect of Imputation on Classification Performance under MCAR and MAR Missing Mechanisms,” Entropy 2023, Vol. 25, Page 521, vol. 25, no. 3, p. 521, Mar. 2023, doi: 10.3390/E25030521.

G. Shen, W. Zhou, W. Zhang, N. Liu, Z. Liu, and X. Kong, “Bidirectional spatial–temporal traffic data imputation via graph attention recurrent neural network,” Neurocomputing, vol. 531, pp. 151–162, Apr. 2023, doi: 10.1016/J.NEUCOM.2023.02.017.

L. Li, Y. Wang, H. Wang, S. Hu, and T. Wei, “An Efficient Architecture for Imputing Distributed Data Sets of IoT Networks,” IEEE Internet Things J, vol. 10, no. 17, pp. 15100–15114, Sep. 2023, doi: 10.1109/JIOT.2023.3264609.

G. Batista and M.-C. Monard, “A Study of K-Nearest Neighbour as an Imputation Method,” in Hybrid Intelligent Systems, ser Front Artificial Intelligence Applications, Jan. 2002, pp. 251–260.

S. Zhang, “Nearest neighbor selection for iteratively kNN imputation,” Journal of Systems and Software, vol. 85, no. 11, pp. 2541–2552, Nov. 2012, doi: 10.1016/J.JSS.2012.05.073.

Y. He and D. Pi, “Improving KNN Method Based on Reduced Relational Grade for Microarray Missing Values Imputation,” IAENG Int J Comput Sci, vol. 43, no. 3, pp. 356–362, 2016.

J.-H. Hsu, C.-H. Wu, W.-K. Wang, H.-Y. Su, E. C.-L. Lin, and P. S. Chen, “Digital Phenotyping-Based Bipolar Disorder Assessment Using Multiple Correlation Data Imputation and Lasso-MLP,” IEEE Trans Affect Comput, pp. 1–14, 2023, doi10.1109/TAFFC.2023.3299607.

I. D. Irawati, A. B. Suksmono, I. J. M.Edward, “An Interpolation Comparative Analysis for Missing Internet Traffic Data,” Proceedings of the 3rd International Conference on Electronics, Communications and Control Engineering, pp. 26-30, 2020, doi:10.1145/3396730.3396740

D. J. Stekhoven and P. Bühlmann, “MissForest—non-parametric missing value imputation for mixed-type data,” Bioinformatics, vol. 28, no. 1, pp. 112–118, Jan. 2012, doi: 10.1093/bioinformatics/btr597.

A. K. Waljee et al., “Comparison of imputation methods for missing laboratory data in medicine,” BMJ Open, vol. 3, no. 8, p. e002847, Aug. 2013, doi: 10.1136/bmjopen-2013-002847.

J. You, J. L. Ellis, S. Adams, M. Sahar, M. Jacobs, and D. Tulpan, “Comparison of imputation methods for missing production data of dairy cattle,” animal, p. 100921, Jul. 2023, doi:10.1016/j.animal.2023.100921.

B. Gong, Z. Xu, C. Lin, and D. Wu, “Heterogeneous Traffic Flow Detection Using CAV-Based Sensor With I-GAIN,” IEEE Access, vol. 11, pp. 32616–32627, 2023, doi: 10.1109/ACCESS.2023.3263720.

G. Vink, L. E. Frank, J. Pannekoek, and S. van Buuren, “Predictive mean matching imputation of semicontinuous variables,” Stat Neerl, vol. 68, no. 1, pp. 61–90, Feb. 2014, doi: 10.1111/stan.12023.

J. Du and L. Zhou, “Improving financial data quality using ontologies,” Decis Support Syst, vol. 54, no. 1, pp. 76–86, Dec. 2012, doi: 10.1016/j.dss.2012.04.016.

Idris NF, Ismail MA, Jaya MIM, Ibrahim AO, Abulfaraj AW, Binzagr F (2024) Stacking with Recursive Feature Elimination-Isolation Forest for classification of diabetes mellitus. PLoS ONE 19(5): e0302595. https://doi.org/10.1371/journal.pone.0302595.