Advanced search
Start date
Betweenand

Comparison of Search Accuracy for Bibliographic Data Among Different Large Language Models in Medicine

Grant number: 25/00568-4
Support Opportunities:Scholarships in Brazil - Scientific Initiation
Start date: March 01, 2025
End date: February 28, 2026
Field of knowledge:Health Sciences - Medicine - Pathological Anatomy and Clinical Pathology
Principal Investigator:Konradin Metze
Grantee:Maria Fernanda de Ávila Reis
Host Institution: Faculdade de Ciências Médicas (FCM). Universidade Estadual de Campinas (UNICAMP). Campinas , SP, Brazil
Company:Universidade Estadual de Campinas (UNICAMP). Faculdade de Engenharia Elétrica e de Computação (FEEC)
Associated research grant:20/09838-0 - BI0S - Brazilian Institute of Data Science, AP.PCPE

Abstract

The significant advances in artificial intelligence (AI) have brought about changes in all areas of science and technology. The creation of Large Language Models (LLMs) has popularized AI among different audiences. These tools offer easy access compared to traditional systems, as communication between humans and computers is done through common language. As a result, they gained popularity among students, for example, by assisting in academic work. However, research has shown that there is a risk of LLMs providing incorrect information to users, which can compromise the quality of AI-based work.Additionally, pilot studies have shown that when searching for bibliographic data in the field of medicine, many results were also inaccurate or fictional. In this context, it is crucial to monitor the quality of these systems, which can be done by applying standardized prompts repeatedly to 5 Large Language Models (ChatGPT 3.5, ChatGPT 4, Copilot, Consensus, and Gemini).These are 50 questions that will be applied to the bibliographic data of researchers in the field of Chagas Disease after a random selection. The answers will be thoroughly compared with the actual bibliographic output as documented in Clarivate's Web of Science and Google Scholar.Subsequently, the quality of the response will be categorized as: correct, minor errors, major errors, and hallucination. The latter term is used when a title does not exist or, when existing, is incorrectly attributed to the author in question. Then, a quantitative comparison will be made between the differences among the various LLMs and the responses between two distinct calls within the same chatbot, with a minimum interval of 6 months.In this regard, the study could provide valuable insights for users in the medical field and help select the least problematic Large Language Models for bibliographic research.

News published in Agência FAPESP Newsletter about the scholarship:
More itemsLess items
Articles published in other media outlets ( ):
More itemsLess items
VEICULO: TITULO (DATA)
VEICULO: TITULO (DATA)