Advanced search
Start date
Betweenand


Rhetorical analysis based on large amount of data

Full text
Author(s):
Erick Galani Maziero
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
Thiago Alexandre Salgueiro Pardo; Katti Faceli; Valéria Delisandra Feltrim; Estevam Rafael Hruschka Júnior; Maria das Graças Volpe Nunes
Advisor: Thiago Alexandre Salgueiro Pardo
Abstract

Considering the almost uncountable textual information available on the web, the auto- matization of several tasks related to the automatic text processing is an undeniable need. In superficial approaches of NLP (Natural Language Processing), important properties of the text are lost, as position, order, adjacency and context of textual segments. A de- eper analysis, as carried out in the discursive level, deals with the identification of the rhetoric organization of the text, generating a hierarchical structure. In this structure, the intentions of the author are identified and related among them. To the automati- zation of this task, most of the works have used machine learning techniques, mainly from the supervised paradigm. In this paradigm, manually labeled data is required to obtain classification models, specially to identify the rhetorical relations. As the manual annotation is a costly process, the obtained results in the task are unsatisfactory, because they are below the human perfomance. In this thesis, the massive use of unlabeled data was applied in a semi-supervised never-ending learning to identify the rhetorical relations. In this exploration, a framework was proposed, which uses texts continuously obtained from the web. In the framework, a variation of traditional semi-supervised algorithms was employed, and it uses a concept-drift monitoring strategy. Besides that, state of the art techniques for English were adapted to Portuguese. Without the human intervention, the F-measure increased, for while, 0.144 (from 0.543 to 0.621). This result consists in the state-of-the-art for Discourse Analysis in Portuguese. (AU)

FAPESP's process: 11/23323-4 - Automatic rhetorical parsing based on large amount of data.
Grantee:Erick Galani Maziero
Support Opportunities: Scholarships in Brazil - Doctorate