Busca avançada
Ano de início
Entree
(Referência obtida automaticamente do Web of Science, por meio da informação sobre o financiamento pela FAPESP e o número do processo correspondente, incluída na publicação pelos autores.)

External memory BWT and LCP computation for sequence collections with applications

Texto completo
Autor(es):
Egidi, Lavinia [1] ; Louza, Felipe A. [2] ; Manzini, Giovanni [1, 3] ; Telles, Guilherme P. [4]
Número total de Autores: 4
Afiliação do(s) autor(es):
[1] Univ Piemonte Orientale, DiSIT, Viale Michel 11, I-15121 Alessandria - Italy
[2] Univ Sao Paulo, Dept Comp & Math, Av Bandeirantes 3900, BR-14040901 Ribeirao Preto - Brazil
[3] CNR, IIT, Via Moruzzi 1, I-56124 Pisa - Italy
[4] Univ Estadual Campinas, Inst Comp, Av Albert Einstein 1251, BR-13083852 Campinas, SP - Brazil
Número total de Afiliações: 4
Tipo de documento: Artigo Científico
Fonte: Algorithms for Molecular Biology; v. 14, MAR 8 2019.
Citações Web of Science: 1
Resumo

BackgroundSequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows-Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM.ResultsWe propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix-prefix overlaps, and the construction of succinct de Bruijn graphs.ConclusionsWe prove that our algorithm performs O(nmaxlcp) sequential I/Os, where n is the total length of the collection and maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input. (AU)

Processo FAPESP: 17/09105-0 - Ordenação de sufixos e medidas de similaridade entre cadeias
Beneficiário:Felipe Alves da Louza
Linha de fomento: Bolsas no Brasil - Pós-Doutorado
Processo FAPESP: 18/21509-2 - Transformada de Burrows-Wheeler e grafos De Bruijn sucintos
Beneficiário:Felipe Alves da Louza
Linha de fomento: Bolsas no Exterior - Estágio de Pesquisa - Pós-Doutorado