Advanced search
Start date
Betweenand


Addressing the gap between current language models and key-term-based clustering

Full text
Author(s):
Cabral, Eric M. ; Rezaeipourfarsangi, Sima ; Oliveira, Maria Cristina F. ; Milios, Evangelos E. ; Minghim, Rosane
Total Authors: 5
Document type: Journal article
Source: PROCEEDINGS OF THE 2023 ACM SYMPOSIUM ON DOCUMENT ENGINEERING, DOCENG 2023; v. N/A, p. 10-pg., 2023-01-01.
Abstract

This paper presents MOD-kt, a modular framework designed to bridge the gap between modern language models and key-term-based document clustering. One of the main challenges of using neural language models for key-term-based clustering is the mismatch between the interpretability of the underlying document representation (i.e. document embeddings) and the more intuitive semantic elements that allow the user to guide the clustering process (i.e. key-terms). Our framework acts as a communication layer between word and document models, enabling key-term-based clustering in the context of document and word models with a flexible and adaptable architecture. We report a comparison of the performance of multiple neural language models on clustering, considering a selected range of relevance metrics. Additionally, a qualitative user study was conducted to illustrate the framework's potential for intuitive user-guided quality clustering of document collections. (AU)

FAPESP's process: 18/22214-6 - Towards a convergence of technologies: from sensing and biosensing to information visualization and machine learning for data analysis in clinical diagnosis
Grantee:Osvaldo Novais de Oliveira Junior
Support Opportunities: Research Projects - Thematic Grants