| Grant number: | 25/21836-7 |
| Support Opportunities: | Scholarships abroad - Research Internship - Scientific Initiation |
| Start date: | February 01, 2026 |
| End date: | May 31, 2026 |
| Field of knowledge: | Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques |
| Principal Investigator: | Ricardo Cerri |
| Grantee: | João Pedro Viguini Tolentino Taufner Correa |
| Supervisor: | Pedro Beltrao |
| Host Institution: | Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil |
| Institution abroad: | Swiss Federal Institute of Technology Zurich, Switzerland |
| Associated to the scholarship: | 25/06512-0 - Temporal Prediction of Genetic Associations in ALS Through NLP and Complex Network Analysis, BP.IC |
Abstract The majority of genetic variants associated with Amyotrophic Lateral Sclerosis (ALS) through Genome-Wide Association Studies (GWAS) reside in non-coding regions, typically spanning large genomic segments that often contain multiple candidate genes. A major challenge in the post-GWAS era is to identify the true causal genes within these loci. To address this, we propose a computational framework that integrates statistical genetics with Natural Language Processing (NLP) to prioritize genes. In conventional workflows, GWAS signals are first fine-mapped, meaning they are refined into smaller sets of variants and genes most likely to drive the association, and only afterwards is functional information used to interpret them. In contrast, our framework incorporates prior biological knowledge directly during this statistical refinement step. Specifically, we use word embeddings derived from biomedical literature via Word2Vec and fastText models to inform gene-level priors in a Bayesian fine-mapping model, ensuring that not all genes start with equal probability of being causal. We hypothesize that embedding-informed priors will increase the probability of recovering not only known ALS causal genes (e.g., \textit{SOD1, C9orf72, FUS, TARDBP}) but also potential new candidates with higher confidence. Final prioritization then combines statistical evidence with semantic similarity scores, bridging association signals with biological function in a unified framework. This approach provides a scalable and interpretable strategy to accelerate causal gene discovery in ALS and other complex diseases. | |
| News published in Agência FAPESP Newsletter about the scholarship: | |
| More itemsLess items | |
| TITULO | |
| Articles published in other media outlets ( ): | |
| More itemsLess items | |
| VEICULO: TITULO (DATA) | |
| VEICULO: TITULO (DATA) | |