Advanced search
Start date
Betweenand

A Complex Network Model for Textual Information Analysis.

Grant number: 10/04917-8
Support type:Scholarships in Brazil - Master
Effective date (Start): September 01, 2010
Effective date (End): December 31, 2011
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:Paulo Sergio Silva Rodrigues
Grantee:Guilherme Alberto Wachs Lopes
Home Institution: Campus de São Bernardo do Campo. Centro Universitário da FEI (UNIFEI). Fundação Educacional Inaciana Padre Sabóia de Medeiros (FEI). São Bernardo do Campo , SP, Brazil

Abstract

Textual Analysis is a human task, which concerns of cognitive process and complexes as well, usually hard to model in current computers. These process, generally parallel, usually consider both lexical and sintatical information in order to fit the text in a correct hierarquical and semantic level. Lexical level information are more related with the language rules to produce words, meanwhile the syntactical level is generally related with word positioning in a text. The whole information (lexical and syntactic) yield to generation of semantic information. Several application areas demanding automatic textual analysis must consider such information in order to get a growing set of goals, such as: textual document retrieval, textual comparison, speech automatic generation, key-word generations, text indexing, to name a few. Although textual interpretation rules are known for a long time, due to facts involving mainly computational time and models with high dimensionalities, many of these rules are not carried out in current practical systems. For instance, the majority systems for textual information retrieval generally is based only in the word frequency domain, or the number of links pointing to the same internet page with the goal of ranking the documents by relevance, under a user query. It is well known that the lexical information underlying stop-words, misspelled words and punctuation, as well as syntactical information, such as the order that the words appears in the text, are not generally considered in these models. This is one of many reasons witch yield to well known semantic gap between the user requisition and the true information proposed by the retrieval model. On the hand, since the begin of 90's, studies in complex networks have been gathering more attention by researches, especially, not only for textual information modeling, but also for multimedia data. Then, the proposed work presents a Complex Network model which considers not only frequency information, but also the order that words appear, co-occurrences, stop-words and misspelled words. The price to pay for this model is the use of a managed space of gigabytes, which is impractical for current hardware technology. Models with such size were not completely studied and present behaviors that are hard to prevent and discuss. The features of complex networks studied by far the one decade in literature (such as: type of network, average clustering coeffient, degree distribution, weight distribution, distance matrix, radius, diameter, spectral coefficient, and others) allow the study of such models for large databases. Then, in this work, we propose to study the textual information modelated as a complex network of words, for specific and generic database as well. Preliminary studies show that words taken from a specific context, considering syntactical and lexical featured cited above, present a free-scale network behavior. Also, we present heuristics for physical properties which are hard to computationally manage, such as average clustering coefficient (ACC) . Results suggest that it is possible to compute the CC with a error of 5% for dense or sparses networks up to 10.000 words.