Busca avançada
Ano de início
Entree


An Analysis of Different Text Representation Schemes for an Immune Clustering Algorithm

Texto completo
Autor(es):
Ferraria, Matheus A. ; Balbi, Pedro P. ; de Castro, Leandro N.
Número total de Autores: 3
Tipo de documento: Artigo Científico
Fonte: DISTRIBUTED COMPUTING AND ARTIFICIAL INTELLIGENCE, 21ST INTERNATIONAL CONFERENCE; v. 1259, p. 11-pg., 2025-01-01.
Resumo

This research investigates the challenges and effectiveness of various text representation methods (standard vector, grammar-based, and distributed), when applied to clustering short texts. The study explores Bag-of-Words for standard vector, Linguistic Inquiry and Word Count (LIWC), Part-of-Speech Tagging (POS-Tagging), and the Medical Research Council Psycholinguistic Database (MRC) for grammar-based, and Word2Vec, fastText, Doc2Vec, and SentenceBERT for distributed representations. Utilizing the aiNet bio-inspired clustering algorithm, the results reveal surprising findings, with grammar-based representations demonstrating competitive performance despite their simplicity, while standard vectors exhibit known challenges like high dimensionality. The study contributes insights into the properties of different text representations, providing a foundation for optimizing their application in clustering tasks with short and informal texts. (AU)

Processo FAPESP: 21/11905-0 - Centro de Ciência, Tecnologia e Desenvolvimento para inovação em Medicina e Saúde: inLab.iNova
Beneficiário:Giovanni Guido Cerri
Modalidade de apoio: Auxílio à Pesquisa - Centros de Ciência para o Desenvolvimento