IEcons: A New Consensus Approach Using Multi-Text Representations for Clustering Task
Résumé
Today we are able to generate a large set of text representations from the simple Bag-of-word (BOW) to the recent transformers capturing the semantic and the contextual text meaning. It was proven that there is no best text representation for text clustering task. Consequently, some works combined text representations using a consensus clustering approach. Two consensus approach types exist, namely explicit and implicit consensus. In the explicit consensus, also known as ensemble clustering, the consensus function is applied a posterior after obtaining cluster labels from each text representation clustering allowing to capture global mutual information between the partitions of all text representations. On the other hand, implicit consensus uses tensor clustering to optimize the clustering consensus partition that deals with similarity matrices of text representations. In this paper, we propose a new consensus text clustering algorithm named IEcons (Implicit-Explicit consensus) that optimizes explicit and implicit consensus clustering simultaneously through text embeddings and tensor representation of texts through similarity matrices. We compare our algorithm with others from the literature on five different textual datasets using several algorithm performance criteria. The comparison results reveal that our algorithm best suits most situations.
Mots clés
Domaines
Informatique [cs]Origine | Fichiers produits par l'(les) auteur(s) |
---|