Advanced search
Start date
Betweenand

Gene Prioritization in ALS through Statistical Evidence and Word Embeddings

Grant number: 25/21836-7
Support Opportunities:Scholarships abroad - Research Internship - Scientific Initiation
Start date: February 01, 2026
End date: May 31, 2026
Field of knowledge:Physical Sciences and Mathematics - Computer Science - Computing Methodologies and Techniques
Principal Investigator:Ricardo Cerri
Grantee:João Pedro Viguini Tolentino Taufner Correa
Supervisor: Pedro Beltrao
Host Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil
Institution abroad: Swiss Federal Institute of Technology Zurich, Switzerland  
Associated to the scholarship:25/06512-0 - Temporal Prediction of Genetic Associations in ALS Through NLP and Complex Network Analysis, BP.IC

Abstract

The majority of genetic variants associated with Amyotrophic Lateral Sclerosis (ALS) through Genome-Wide Association Studies (GWAS) reside in non-coding regions, typically spanning large genomic segments that often contain multiple candidate genes. A major challenge in the post-GWAS era is to identify the true causal genes within these loci. To address this, we propose a computational framework that integrates statistical genetics with Natural Language Processing (NLP) to prioritize genes. In conventional workflows, GWAS signals are first fine-mapped, meaning they are refined into smaller sets of variants and genes most likely to drive the association, and only afterwards is functional information used to interpret them. In contrast, our framework incorporates prior biological knowledge directly during this statistical refinement step. Specifically, we use word embeddings derived from biomedical literature via Word2Vec and fastText models to inform gene-level priors in a Bayesian fine-mapping model, ensuring that not all genes start with equal probability of being causal. We hypothesize that embedding-informed priors will increase the probability of recovering not only known ALS causal genes (e.g., \textit{SOD1, C9orf72, FUS, TARDBP}) but also potential new candidates with higher confidence. Final prioritization then combines statistical evidence with semantic similarity scores, bridging association signals with biological function in a unified framework. This approach provides a scalable and interpretable strategy to accelerate causal gene discovery in ALS and other complex diseases.

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)