Advanced search
Start date
Betweenand


An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Full text
Author(s):
Ferraria, Matheus A. ; Balbi, Pedro P. ; de Castro, Leandro N.
Total Authors: 3
Document type: Journal article
Source: DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE, 21ST INTERNATIONAL CONFERENCE; v. 1259, p. 11-pg., 2025-01-01.
Abstract

This research investigates the challenges and effectiveness of various text representation methods (standard vector, grammar-based, and distributed), when applied to clustering short texts. The study explores Bag-of-Words for standard vector, Linguistic Inquiry and Word Count (LIWC), Part-of-Speech Tagging (POS-Tagging), and the Medical Research Council Psycholinguistic Database (MRC) for grammar-based, and Word2Vec, fastText, Doc2Vec, and SentenceBERT for distributed representations. Utilizing the aiNet bio-inspired clustering algorithm, the results reveal surprising findings, with grammar-based representations demonstrating competitive performance despite their simplicity, while standard vectors exhibit known challenges like high dimensionality. The study contributes insights into the properties of different text representations, providing a foundation for optimizing their application in clustering tasks with short and informal texts. (AU)

FAPESP's process: 21/11905-0 - Center of Science, Technology and Development for innovation in Medicine and Health: inLab.iNova
Grantee:Giovanni Guido Cerri
Support Opportunities: Research Grants - Science Centers for Development