Busca avançada
Ano de início
Entree
(Referência obtida automaticamente do Web of Science, por meio da informação sobre o financiamento pela FAPESP e o número do processo correspondente, incluída na publicação pelos autores.)

Practical implications of using non-relational databases to store large genomic data files and novel phenotypes

Texto completo
Autor(es):
Souza, Andre Moreira [1] ; Santos Weigert, Rodrigo de Andrade [1] ; Machado de Sousa, Elaine Parros [1] ; Andrietta, Lucas Tassoni [2] ; Ventura, Ricardo Vieira [2]
Número total de Autores: 5
Afiliação do(s) autor(es):
[1] Univ Sao Paulo, Inst Math & Comp Sci, Sao Carlos, SP - Brazil
[2] Univ Sao Paulo, Sch Vet Med & Anim Sci, Dept Anim Nutr & Prod, BR-13635900 Pirassununga, SP - Brazil
Número total de Afiliações: 2
Tipo de documento: Artigo Científico
Fonte: JOURNAL OF ANIMAL BREEDING AND GENETICS; v. 139, n. 1 AUG 2021.
Citações Web of Science: 0
Resumo

The objective of our study was to provide practical directions on the storage of genomic information and novel phenotypes (treated here as unstructured data) using a non-relational database. The MongoDB technology was assessed for this purpose, enabling frequent data transactions involving numerous individuals under genetic evaluation. Our study investigated different genomic (Illumina Final Report, PLINK, 0125, FASTQ, and VCF formats) and phenotypic (including media files) information, using both real and simulated datasets. Advantages of our centralized database concept include the sublinear running time for queries after increasing the number of samples/markers exponentially, in addition to the comprehensive management of distinct data formats while searching for specific genomic regions. A comparison of our non-relational and generic solution, with an existing relational approach (developed for tabular data types using 2 bits to store genotypes), showed reduced importing time to handle 50M SNPs (PLINK format) achieved by the relational schema. Our experimental results also reinforce that data conversion is a costly step required to manage genomic data into both relational and non-relational database systems, and therefore, must be carefully treated for large applications. (AU)

Processo FAPESP: 20/04461-6 - Uso de machine learning e dados genômicos para melhoria de características econômicas em bovinos de leite
Beneficiário:Lucas Tassoni Andrietta
Modalidade de apoio: Bolsas no Brasil - Mestrado
Processo FAPESP: 16/19514-2 - Desenvolvimento de um banco de dados genômicos para a raça Nelore e criação de ferramentas computacionais objetivando a implementação de estudos em larga escala
Beneficiário:Ricardo Vieira Ventura
Modalidade de apoio: Auxílio à Pesquisa - Jovens Pesquisadores