Text Classification Using Genetic Programming with Implementation of Map Reduce and Scraping
DOI: http://dx.doi.org/10.30630/joiv.7.2.1813
Abstract
Classification of text documents on online media is a big data problem and requires automation. Text classification accuracy can decrease if there are many ambiguous terms between classes. Hadoop Map Reduce is a parallel processing framework for big data that has been widely used for text processing on big data. The study presented text classification using genetic programming by pre-processing text using Hadoop map-reduce and collecting data using web scraping. Genetic programming is used to perform association rule mining (ARM) before text classification to analyze big data patterns. The data used are articles from science-direct with the three keywords. This study aims to perform text classification with ARM-based data pattern analysis and data collection system through web-scraping, pre-processing using map-reduce, and text classification using genetic programming. Through web scraping, data has been collected by reducing duplicates as much as 17718. Map-reduce has tokenized and stopped-word removal with 36639 terms with 5189 unique terms and 31450 common terms. Evaluation of ARM with different amounts of multi-tree data can produce more and longer rules and better support. The multi-tree also produces more specific rules and better ARM performance than a single tree. Text classification evaluation shows that a single tree produces better accuracy (0.7042) than a decision tree (0.6892), and the lowest is a multi-tree(0.6754). The evaluation also shows that the ARM results are not in line with the classification results, where a multi-tree shows the best result (0.3904) from the decision tree (0.3588), and the lowest is a single tree (0.356).
Keywords
Full Text:
PDFReferences
I. Pintye, E. Kail, P. Kacsuk, and R. Lovas, “Big data and machine learning framework for clouds and its usage for text classification,†Concurr Comput, vol. 33, no. 19, p. e6164, 2021.
M. Abdel-Basset, M. Mohamed, F. Smarandache, and V. Chang, “Neutrosophic association rule mining algorithm for big data analysis,†Symmetry (Basel), vol. 10, no. 4, p. 106, 2018.
H. U. Rahman, R. U. Khan, and A. Ali, “Programming and Pre-Processing Systems for Big Data Storage and Visualization,†in Handbook of Research on Big Data Storage and Visualization Techniques, IGI Global, 2018, pp. 228–253.
B. Altinel and M. C. Ganiz, “Semantic text classification: A survey of past and recent advances,†Inf Process Manag, vol. 54, no. 6, pp. 1129–1153, 2018.
I. Alsmadi and G. K. Hoon, “Term weighting scheme for short-text classification: Twitter corpuses,†Neural Comput Appl, vol. 31, no. 8, pp. 3819–3831, 2019.
S. Du and J. Li, “Parallel processing of improved KNN text classification algorithm based on Hadoop,†in 2019 7th International Conference on Information, Communication and Networks (ICICN), 2019, pp. 167–170.
H. Jeong and K. J. Cha, “An efficient mapreduce-based parallel processing framework for user-based collaborative filtering,†Symmetry (Basel), vol. 11, no. 6, p. 748, 2019.
H.-N. Dai, H. Wang, G. Xu, J. Wan, and M. Imran, “Big data analytics for manufacturing internet of things: opportunities, challenges and enabling technologies,†Enterp Inf Syst, vol. 14, no. 9–10, pp. 1279–1303, 2020.
K. v Ranjitha, B. S. V. Prasad, and others, “Optimization Scheme for Text Classification Using Machine Learning Na{"i}ve Bayes Classifier,†in ICDSMLA 2019, Springer, 2020, pp. 576–586.
A. Tahmassebi and A. H. Gandomi, “Genetic programming based on error decomposition: A big data approach,†in Genetic programming theory and practice XV, Springer, 2018, pp. 135–147.
T. Haryanto, A. Pratama, H. Suhartanto, A. Murni, K. Kusmardi, and J. Pidanic, “Multipatch-GLCM for texture feature extraction on classification of the colon histopathology images using deep neural network with GPU acceleration,†Journal of Computer Science, vol. 16, no. 3, pp. 280–294, 2020.
D. M. Thomas and S. Mathur, “Data analysis by web scraping using python,†in 2019 3rd International conference on Electronics, Communication and Aerospace Technology (ICECA), 2019, pp. 450–454.
V. Krotov, L. Johnson, and L. Silva, “Tutorial: Legality and ethics of web scraping,†Communications of the Association for Information Systems, vol. 47, no. 1, 2020, doi: 10.17705/1CAIS.04724.
M. Dogucu and M. Çetinkaya-Rundel, “Web Scraping in the Statistics and Data Science Curriculum: Challenges and Opportunities,†Journal of Statistics Education, 2020, doi: 10.1080/10691898.2020.1787116.
A. Telikani, A. H. Gandomi, and A. Shahbahrami, “A survey of evolutionary computation for association rule mining,†Inf Sci (N Y), vol. 524, pp. 318–352, 2020.
C. Gakii and R. Rimiru, “Identification of cancer related genes using feature selection and association rule mining,†Inform Med Unlocked, vol. 24, 2021, doi: 10.1016/j.imu.2021.100595.
W. Thurachon and W. Kreesuradej, “Incremental Association Rule Mining with a Fast Incremental Updating Frequent Pattern Growth Algorithm,†IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3071777.
J. Ramsingh and V. Bhuvaneswari, “An efficient Map Reduce-Based Hybrid NBC-TFIDF algorithm to mine the public sentiment on diabetes mellitus--A big data approach,†Journal of King Saud University-Computer and Information Sciences, 2018.
A. K. Ngo Ho and F. Yvon, “Optimizing Word Alignments with Better Subword Tokenization,†Proceedings of Machine Translation Summit XVIII: Research Track, 2021.
K. Sirts and K. Peekman, “Evaluating sentence segmentation and word tokenization systems on estonian web texts,†in Frontiers in Artificial Intelligence and Applications, 2020, vol. 328. doi: 10.3233/faia200620.
X. Deng, Y. Li, J. Weng, and J. Zhang, “Feature selection for text classification: A review.,†Multimed Tools Appl, vol. 78, no. 3, 2019.
T. Ma, R. Al-Sabri, L. Zhang, B. Marah, and N. Al-Nabhan, “The Impact of Weighting Schemes and Stemming Process on Topic Modeling of Arabic Long and Short Texts,†ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 19, no. 6, 2020, doi: 10.1145/3405843.
S. S. Samant, N. L. Bhanu Murthy, and A. Malapati, “Improving Term Weighting Schemes for Short Text Classification in Vector Space Model,†IEEE Access, vol. 7, 2019, doi: 10.1109/ACCESS.2019.2953918.
A. S. Halibas, A. S. Shaffi, and M. A. K. V. Mohamed, “Application of text classification and clustering of Twitter data for business analytics,†in 2018 Majan international conference (MIC), 2018, pp. 1–7.
P. Koutris, S. Salihoglu, D. Suciu, and others, “Algorithmic aspects of parallel data processing,†Foundations and Trends®in Databases, vol. 8, no. 4, pp. 239–370, 2018.
B. Anjum, “MapReduce--The Scalable Distributed Data Processing Solution,†in Topics in Parallel and Distributed Computing, Springer, 2018, pp. 173–190.
S. Oliviandi, A. B. Osmond, and R. Latuconsina, “Implementasi Apache Spark Pada Big Data Berbasis Hadoop Distributed File System,†e-Proceeding of Engineering, vol. 5, no. 1 Maret, 2018.
N. D. Sapoetra, R. Ridwan, M. A. K. Sahide, and K. Masuda, “Local community’s perception, attitude, and participation towards different level management of geopark: A comparison Geosite case study, between Muroto Cape and Rammang-rammang Geosite,†in IOP Conference Series: Earth and Environmental Science, 2019, vol. 343, no. 1. doi: 10.1088/1755-1315/343/1/012044.
K. Kousalya and S. J. Parvez, “Effective processing of unstructured data using python in Hadoop map reduce,†International Journal of Engineering & Technology, vol. 7, no. 2.21, pp. 417–419, 2018.
A. G. C. de Sá, A. A. Freitas, and G. L. Pappa, “Automated selection and configuration of multi-label classification algorithms with grammar-based genetic programming,†in International Conference on Parallel Problem Solving from Nature, 2018, pp. 308–320.
L. W. Santoso, B. Singh, S. S. Rajest, R. Regin, and K. H. Kadhim, “A Genetic Programming Approach to Binary Classification Problem,†EAI Endorsed Transactions on Energy Web, vol. 8, no. 31, 2021, doi: 10.4108/eai.13-7-2018.165523.
F. Viegas et al., “A genetic programming approach for feature selection in highly dimensional skewed data,†Neurocomputing, vol. 273, pp. 554–569, 2018.