Classification of Air Pollutant Index on Data with Outliers and Imbalance Class Problem

Dwi Krisbiantoro - Universitas Amikom Purwokerto, Purwokerto 53127, Indonesia
Retno Waluyo - Universitas Amikom Purwokerto, Purwokerto 53127, Indonesia
Uswatun Hasanah - Universitas Negeri Semarang, Semarang, Indonesia
Irfan Pratama - Universitas Mercu Buana Yogyakarta, Yogyakarta, Indonesia
Sarmini Sarmini - Universitas Amikom Purwokerto, Purwokerto 53127, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.62527/joiv.8.3.1993

Abstract


The problem of air pollution has become a global issue that has received attention from various countries. Jakarta, Indonesia's capital city, is unavoidable from the same problem. This study will use four parameters of substances PM10, SO2, CO, O3, and nitrogen dioxide to categorize Jakarta's air quality (NO2). The data used is daily data taken from the Air Quality Monitoring Station in Jakarta throughout 2020. The methods used include SVM, Random Forest, Logistic Regression, KNN, CART, and Stacking Algorithm. At the data preparation stage, we found missing values, outliers, and class imbalance problems. Before applying machine learning methods and evaluating accuracy, we used data pre-processing techniques such as the MissForest method, median substitution, and ADASYN. The results prove that the original dataset has a higher accuracy score (0.882 – 0.977) than the balanced dataset (0.829 – 0.976). According to the evaluation results, the Random Forest method has the highest accuracy score for original and balanced datasets. The overall result is better than the identical research, which produces 96.61% accuracy using a neural network. It shows that preprocessing steps such as missing values handling an imbalanced class handling is essential in classification performance.


Keywords


Air pollution; imbalance class; random forest.

Full Text:

PDF

References


D. Agustian et al., “Feasibility of Indonesia family life survey wave 5 (Ifls5) data for air pollution exposure-response study in Indonesia,” Int. J. Environ. Res. Public Health, vol. 17, no. 24, pp. 1–18, 2020, doi: 10.3390/ijerph17249508.

WHO, “WHO European High-level Conference on Noncommunicable Diseases,” 2018. .

Y. A. Türk and M. Kavraz, “Air pollutants and its effects on human healthy: the case of the city of Trabzon,” in Advanced Topics in Environmental Health and Air Pollution Case Studies, BoD--Books on Demand, 2011, p. 251.

S. Kashima, T. Yorifuji, T. Tsuda, J. Ibrahim, and H. Doi, “Effects of traffic-related outdoor air pollution on respiratory illness and mortality in children, taking into account indoor air pollution, in Indonesia,” J. Occup. Environ. Med., vol. 52, no. 3, pp. 340–345, 2010, doi: 10.1097/JOM.0b013e3181d44e3f.

B. Haryanto, “Climate Change and Urban Air Pollution Health Impacts in Indonesia,” Springer Clim., pp. 215–239, 2018, doi: 10.1007/978-3-319-61346-8_14.

A. TRI-TUGASWATI, Review of Air Pollution and Its Health Impact in Indonesia, vol. 0. ACADEMIC PRESS, INC., 1994.

D. Mintz, “Technical assistance document for the reporting of daily air quality-the air quality index (AQI),” Tech. Res. Triangle Park. US Environ. Prot. Agency, 2009.

A. Wibisono et al., “Dataset of short-term prediction of CO2 concentration based on a wireless sensor network,” Data Br., vol. 31, p. 105924, 2020, doi: 10.1016/j.dib.2020.105924.

A. P. Yudison and Driejana, “Development of indoor air pollution concentration prediction by geospatial analysis,” J. Eng. Technol. Sci., vol. 47, no. 3, pp. 306–319, 2015, doi: 10.5614/j.eng.technol.sci.2015.47.3.6.

B. Sugiarto and R. Sustika, “Data classification for air quality on wireless sensor network monitoring system using decision tree algorithm,” in 2016 2nd International Conference on Science and Technology-Computer (ICST), 2017, pp. 172–176, doi: 10.1109/ICSTC.2016.7877369.

W. M. Septiawan and S. N. Endah, “Suitable Recurrent Neural Network for Air Quality Prediction with Backpropagation Through Time,” in 2018 2nd International Conference on Informatics and Computational Sciences, ICICoS 2018, 2018, pp. 196–201, doi: 10.1109/ICICOS.2018.8621720.

S. M. Saad et al., “Pollutant recognition based on supervised machine learning for Indoor Air Quality monitoring systems,” Appl. Sci., vol. 7, no. 8, 2017, doi: 10.3390/app7080823.

F. Hamami and I. Fithriyah, “Classification of Air Pollution Levels using Artificial Neural Network,” in 2020 International Conference on Information Technology Systems and Innovation (ICITSI), 2020, pp. 217–220, doi: 10.1109/ICITSI50517.2020.9264910.

M. L. Yadav and B. Roychoudhury, “Handling missing values: A study of popular imputation packages in R,” Knowledge-Based Syst., vol. 160, pp. 104–118, 2018, doi: 10.1016/j.knosys.2018.06.012.

R. Chambers, A. Hentges, and X. Zhao, “Robust automatic methods for outlier and error detection,” J. R. Stat. Soc. Ser. A (Statistics Soc., vol. 167, no. 2, pp. 323–339, 2004.

J. H. Sullivan, M. Warkentin, and L. Wallace, “So many ways for assessing outliers: What really works and does it matter?,” J. Bus. Res., vol. 132, no. October 2020, pp. 530–543, 2021, doi: 10.1016/j.jbusres.2021.03.066.

A. Gosain and S. Sardana, “Handling class imbalance problem using oversampling techniques: A review,” 2017 Int. Conf. Adv. Comput. Commun. Informatics, ICACCI 2017, vol. 2017-Janua, pp. 79–85, 2017, doi: 10.1109/ICACCI.2017.8125820.

J. Liu, Y. Gao, and F. Hu, “A fast network intrusion detection system using adaptive synthetic oversampling and LightGBM,” Comput. Secur., vol. 106, p. 102289, 2021, doi: 10.1016/j.cose.2021.102289.

J. Brownlee, Master Machine Learning Algorithms: Discover How They Work and Implement Them From Scratch. 2016.

A. C. Müller and S. Guido, Introduction to machine learning with Python: a guide for data scientists. “ O’Reilly Media, Inc.,” 2016.

Uddin, S., Haque, I., Lu, H., Moni, M. A., & Gide, E. (2022). Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. In Scientific Reports (Vol. 12, Issue 1). Springer Science and Business Media LLC. https://doi.org/10.1038/s41598-022-10358-x.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and regression trees. Routledge, 2017.

S. Raschka, Python machine learning. Packt publishing ltd, 2015.

B. Lantz, Machine Learning with R, 2nd ed. Birmingham B3 2PB, UK: Packt Publishing Ltd., 2013.

W. Richert, Building machine learning systems with Python. Packt Publishing Ltd, 2013.

L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.

Y. Li, J. Gao, and W. Fan, “Ensemble Learning,” in Data Classification: Algorithms and Applications, 2015, pp. 483–510.

M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Inf. Process. & Manag., vol. 45, no. 4, pp. 427–437, 2009.

R. Kohavi and others, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Ijcai, 1995, vol. 14, no. 2, pp. 1137–1145.

M. Raza, F. K. Hussain, O. K. Hussain, M. Zhao, and Z. ur Rehman, “A comparative analysis of machine learning models for quality pillar assessment of SaaS services by multi-class text classification of users’ reviews,” Futur. Gener. Comput. Syst., vol. 101, pp. 341–371, 2019, doi: 10.1016/j.future.2019.06.022.