Low coverage sequencing data analysis using imputation: Marker selection and population genetics.

Marcus Vinicius Niz Alvarez
Total Authors: 1
Document type: Master's Dissertation
Press: Botucatu. 2020-05-19.
Institution: Universidade Estadual Paulista (Unesp). Instituto de Biociências. Botucatu
Defense date:
Advisor: Paulo Eduardo Martins Ribolla

Introduction: Strategy development to reduce the cost of whole genome sequencing (WGS) is important for projects that demand large quantities of samples. A low-cost strategy is low-coverage sequencing combined with imputation techniques for efficient genotyping and sufficient confiability. Malaria is one of the main diseases transmitted by arthropods in the world and Brazil is considered a country with a high incidence of malaria, especially in the Amazon region with the main vector being the Anopheles darlingi mosquito. Objective: The objective of the present study was to develop a strategy to analyze low-coverage WGS data from Anopheles darlingi mosquitoes collected in the municipality of Mâncio Lima in Acre State and verify associations between genetic data and data of epidemiological importance, such as biting behavior, time of activity and distance on a microgeographic scale. Materials and methods: Samples of Anopheles darlingi mosquitoes were collected in the municipality of Mâncio Lima - AC, between 2016 and 2017. The libraries were prepared with Nextera ™ XT and sequenced on Illumina's NextSeq500. Genotyping by sequencing was performed and imputation was applied. Genome wide association studies were performed with biting behavior and time of activity. Population stratification signals were investigated by genome-wide FST and permutation test applied for significance. Results: Weak but significant stratification signals were identified considering distances of 2 to 3 km between the groups. Significant associations were observed between biting behavior and single nucleotide polymorphisms (SNP), mainly in SNP adjacent to the Cyp450 gene. Significant associations were observed between time of activity and SNP, including SNP adjacent to the timeless-2 and rdgC genes. Conclusions: The use of low coverage WGS data and data imputation is a viable strategy for cost reduction in genomic sequencing projects with large amounts of samples. The results of the stratification analyzes support the hypothesis that the population of Anopheles darlingi is in genetic stratification process on a microgeographic scale in the municipality of Mâncio Lima. The results of genome wide association studies suggest that significant SNPs for biting behavior may be associated with insecticide resistance genes and significant SNPs for time of activity suggest an association with genes related to circadian cycle regulation. (AU)