Advanced search
Start date
Betweenand

Linguistic characterization of human multi-document summaries: exploiting the lexical level

Grant number: 13/12524-4
Support type:Scholarships in Brazil - Scientific Initiation
Effective date (Start): September 01, 2013
Effective date (End): August 31, 2014
Field of knowledge:Linguistics, Literature and Arts - Linguistics
Principal Investigator:Ariani Di Felippo
Grantee:Vanessa Marcasso
Home Institution: Centro de Educação e Ciências Humanas (CECH). Universidade Federal de São Carlos (UFSCAR). São Carlos , SP, Brazil

Abstract

Computational applications able to handle the incredible amount of available information, mainly on-line, have become increasingly relevant. The automatic Multi-Document Summarization (MDS) is one of these applications. It aims at automatically producing a unique summary from a group of texts on the same topic. In order to produce automatic summaries without cohesion and coherence problems, the MDS methods have to deal with multi-document phenomena, such as redundancy, complementarity and contradiction among information units. Despite the recent interest in MDS, many systems have already been developed, including for Portuguese. Given the importance of MDS systems, the linguistic characterization of human multi-document summaries becomes increasingly necessary as it generates knowledge for the production of linguistically-motivated summaries. Thus, the goal of this undergraduate research project is to characterize human multi-document summaries at the lexical level. Being part of the SUSTENTO project (2012/13246-5 FAPESP / CNPq 483231/2012-6), which aims at generating linguistic knowledge for MDS of Portuguese, this project aims at (i) specifying the density of nouns, adjectives, verbs, and adverbs in the summaries in relation to the their source texts and (ii) describing similarities and differences of these lexical units in the summaries and their source texts. Thus, in the end, the goal is to obtain lexical features that can be taken as conditions for the automatic production of linguistically-motivated summaries.