Busca avançada
Ano de início
Entree
(Referência obtida automaticamente do Web of Science, por meio da informação sobre o financiamento pela FAPESP e o número do processo correspondente, incluída na publicação pelos autores.)

Generalized enhanced suffix array construction in external memory

Texto completo
Autor(es):
Louza, Felipe A. [1] ; Telles, Guilherme P. [2] ; Hoffmann, Steve [3, 4] ; Ciferri, Cristina D. A. [5]
Número total de Autores: 4
Afiliação do(s) autor(es):
[1] Univ Sao Paulo, Dept Comp & Math, Av Bandeirantes 3900, BR-14040901 Ribeirao Preto - Brazil
[2] Univ Estadual Campinas, Inst Comp, Av Albert Einstein 1251, BR-13083852 Campinas, SP - Brazil
[3] Fritz Lipman Inst, Leibniz Inst Aging, Computat Biol, Beutenbergstr 11, D-07745 Jena - Germany
[4] Friedrich Schiller Univ Jena, Beutenbergstr 11, D-07745 Jena - Germany
[5] Univ Sao Paulo, Inst Math & Comp Sci, Av Trabalhador Sao Carlense 400, BR-13560970 Sao Carlos, SP - Brazil
Número total de Afiliações: 5
Tipo de documento: Artigo Científico
Fonte: Algorithms for Molecular Biology; v. 12, DEC 7 2017.
Citações Web of Science: 3
Resumo

Background: Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. Results: In this article we present and analyze eGSA {[}introduced in CPM (External memory generalized suffix and LCP arrays construction. In: Proceedings of CPM. pp 201-10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, eGSA showed a competitive performance when compared to eSAIS and SAscan, which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. Conclusions: The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time. (AU)

Processo FAPESP: 17/09105-0 - Ordenação de sufixos e medidas de similaridade entre cadeias
Beneficiário:Felipe Alves da Louza
Linha de fomento: Bolsas no Brasil - Pós-Doutorado
Processo FAPESP: 11/15423-9 - Proposta de um índice biológico persistente baseado em vetores de sufixo generalizados
Beneficiário:Felipe Alves da Louza
Linha de fomento: Bolsas no Brasil - Mestrado
Processo FAPESP: 11/23904-7 - Processamento de consultas OLAP com predicados de similaridade entre imagens e predicados espaciais em ambientes de data warehousing não-convencionais
Beneficiário:Cristina Dutra de Aguiar Ciferri
Linha de fomento: Auxílio à Pesquisa - Regular