A Framework of Mutual Information Kullback-Leibler Divergence based for Clustering Categorical Data

Iwan Tri Yanto - Department of Information System, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
Ririn Setiyowati - Department of Mathematics, Universitas Sebelas Maret, Surakarta, Indonesia
Nur Azizah - Department of Mathematics, Universitas Ahmad Dahlan, Yogyakarta, Indonesia
- Rasyidah - Department of Information Technology, Politeknik Negeri Padang, Indonesia

Citation Format:

DOI: http://dx.doi.org/10.30630/joiv.5.1.462


Clustering is a process of grouping a set of objects into multiple clusters, so that the collection of similar objects will be grouped into the same cluster and dissimilar objects will be grouped into other clusters. Fuzzy k-means Algorithm is one of clustering algorithm by partitioning data into k clusters employing Euclidean distance as a distance function. This research discusses clustering categorical data using Fuzzy k-Means Kullback-Leibler Divergence. In the determination of the distance between data and center of cluster uses mutual information known as Kullback-Leibler Divergence distance between the joint distribution and the product distribution from two marginal distributions. Extensive theoretical analysis was performed to show the effectiveness of the proposed method. Moreover, the proposed method's comparison results with Fuzzy Centroid and Fuzzy k-Partition approaches in terms of response time and clustering accuracy were also performed employing several datasets from UCI Machine Learning. The experiment results show that the proposed Algorithm provides good results both from clustering quality and accuracy for clustering categorical data as compared to Fuzzy Centroid and Fuzzy k-Partition.


Kullback-Leibler divergence; mutual information; fuzzy k-means; categorical data; clustering.

Full Text:



J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.

J. A. Hartigan and M. A. Wong, “Algorithm AS 136: A k-means clustering algorithm,†J. R. Stat. Soc. Ser. C (Applied Stat., vol. 28, no. 1, pp. 100–108, 1979.

J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: The fuzzy c-means clustering algorithm,†Comput. Geosci., vol. 10, no. 2–3, pp. 191–203, 1984.

E. Sutoyo, I. T. R. Yanto, R. R. Saedudin, and T. Herawan, “A soft set-based co-occurrence for clustering web user transactions,†Telkomnika (Telecommunication Comput. Electron. Control., vol. 15, no. 3, 2017.

I. T. R. Yanto, M. A. Ismail, and T. Herawan, “A modified Fuzzy k-Partition based on indiscernibility relation for categorical data clustering,†Eng. Appl. Artif. Intell., vol. 53, pp. 41–52, 2016.

Z. Huang and M. K. Ng, “A fuzzy k-modes algorithm for clustering categorical data,†IEEE Trans. Fuzzy Syst., vol. 7, no. 4, pp. 446–452, 1999.

J. C. Bezdek, “A convergence theorem for the fuzzy ISODATA clustering algorithms,†IEEE Trans. Pattern Anal. Mach. Intell., no. 1, pp. 1–8, 1980.

J. Wu, Advances in K-means clustering: a data mining thinking. Springer Science & Business Media, 2012.

T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012.

L.-X. Wang, A course in fuzzy systems. Prentice-Hall press, USA, 1999.

A. K. Jain and R. C. Dubes, “Algorithms for clustering data,†1988..

H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval, vol. 39. Cambridge University Press, 2008.

L. Hubert and P. Arabie, “Comparing partitions,†J. Classif., vol. 2, no. 1, pp. 193–218, 1985.