Statistical analysis of evolution by genome rearrangements

Priscila do Nascimento Biller

Full text
Author(s):	Priscila do Nascimento Biller Total Authors: 1
Document type:	Doctoral Thesis
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Instituto de Computação
Defense date:	2016-08-15
Examining board members:	João Meidanis; Cid Carvalho de Souza; João Carlos Setubal; Sergio Russo Matioli; Zanoni Dias; Fábio Luiz Usberti
Advisor:	João Meidanis
Abstract
The comparative method in evolutionary biology consists in detecting similarities and differences between extant organisms, and, based on more or less formalized hypotheses on the evolutionary processes, infer ancestral states explaining the similarities and an evolutionary history explaining the differences. A classical problem in comparative genomics is to compare two genomes and estimate the amount of evolutionary change that has occurred in the lineages separating them. Evolutionary changes in genomes can happen at different scales, from single nucleotide mutations to large chromosomal rearrangements. In this thesis we present new models of evolution by rearrangements, and statistical estimations based upon them. We first propose an exact, closed, analytically invertible formula for the expected number of breakpoints after a given number of Double-Cut-and-Join (DCJ) operations. This improves over the heuristic, recursive and computationally slower previously proposed one. Then we establish formal links between genome evolution by DCJ and three well-known processes (binary sequences under substitutions, permutations under transpositions, and random graphs), and in consequence theoretically found or correct the intuitions of former studies. In order to validate the ability to estimate the number of rearrangements in biological data and to produce benchmarks for rearrangement studies, we used Aevol, an in silico experimental evolution platform designed to understand processes of genome structural evolution. We tested several estimates based on traditional models of evolution by inversions, and showed that most combinatorial and statistical estimators, which were behaving perfectly on ad-hoc simulations, failed on this dataset. Ad-hoc simulations very often encode the same simplifications and assumptions as the inference methods. Artificial life systems and in silico models of genome evolution are however independent and based on more sophisticated biological principles than most ad-hoc simulators. In consequence, we argue that the data they produce is probably closer to actual biological data. We then provide an in-depth examination of the flaws that we identified in the analyzed models. These flaws fall in two categories: one is to ignore the heterogeneity of susceptibility to breakage across genomic regions, and the other is to suppose that the number of susceptible regions is given. We then propose a model of evolution by inversions where breakage probabilities vary across regions and over time. It subsumes as a particular case the uniform breakage model on the nucleotidic sequence, in which breakage probabilities are proportional to fragile region lengths. In this particular case, the equilibrium distribution in the model resembles the distribution of intergene sizes from diverse organisms. This model is very different from the frequently used model in which all fragile regions have the same probability to break. Estimates based on our model had incomparably better performances on simulated data, and gave the most plausible results on pairs of amniote genomes when the number of susceptible regions was co-estimated (AU)

FAPESP's process:	12/14104-0 - Genome Rearrangement Problems Viewed Through Permutations, Matrices and Other Algebraic Concepts
Grantee:	Priscila Do Nascimento Biller
Support Opportunities:	Scholarships in Brazil - Doctorate

Short URL