Advanced search
Start date
Betweenand


Similarity in big data

Full text
Author(s):
Lúcio Fernandes Dutra Santos
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
Caetano Traina Junior; Sergio Lifschitz; Marcela Xavier Ribeiro; Vaninha Vieira dos Santos; Marcos Rodrigues Vieira
Advisor: Caetano Traina Junior
Abstract

The data being collected and generated nowadays increase not only in volume, but also in complexity, requiring new query operators. Health care centers collecting image exams and remote sensing from satellites and from earth-based stations are examples of application domains where more powerful and flexible operators are required. Storing, retrieving and analyzing data that are huge in volume, structure, complexity and distribution are now being referred to as big data. Representing and querying big data using only the traditional scalar data types are not enough anymore. Similarity queries are the most pursued resources to retrieve complex data, but until recently, they were not available in the Database Management Systems. Now that they are starting to become available, its first uses to develop real systems make it clear that the basic similarity query operators are not enough to meet the requirements of the target applications. The main reason is that similarity is a concept formulated considering only small amounts of data elements. Nowadays, researchers are targeting handling big data mainly using parallel architectures, and only a few studies exist targeting the efficacy of the query answers. This Ph.D. work aims at developing variations for the basic similarity operators to propose better suited similarity operators to handle big data, presenting a holistic vision about the database, increasing the effectiveness of the provided answers, but without causing impact on the efficiency on the searching algorithms. To achieve this goal, four mainly contributions are presented: The first one was a result diversification model that can be applied in any comparison criteria and similarity search operator. The second one focused on defining sampling and grouping techniques with the proposed diversification model aiming at speeding up the analysis task of the result sets. The third contribution concentrated on evaluation methods for measuring the quality of diversified result sets. Finally, the last one defines an approach to integrate the concepts of visual data mining and similarity with diversity searches in content-based retrieval systems, allowing a better understanding of how the diversity property is applied in the query process. (AU)

FAPESP's process: 13/01517-7 - Simlarity in Big Data
Grantee:Lúcio Fernandes Dutra Santos
Support type: Scholarships in Brazil - Doctorate