Understanding the performance of machine learning models from data- to patient-level

Valeriano, Maria gabriela; Matran-fernandez, Ana; Kiffer, Carlos; Lorena, Ana Carolina

Full text
Author(s):	Valeriano, Maria gabriela ; Matran-fernandez, Ana ; Kiffer, Carlos ; Lorena, Ana Carolina Total Authors: 4
Document type:	Journal article
Source:	ACM JOURNAL OF DATA AND INFORMATION QUALITY; v. 16, n. 4, p. 19-pg., 2024-12-01.
Abstract
Machine Learning (ML) models have the potential to support decision-making in healthcare by grasping complex patterns within data. However, decisions in this domain are sensitive and require active involvement of domain specialists with deep knowledge of the data. To address this task, clinicians need to understand how predictions are generated so they can provide feedback for model refinement. There is usually a gap in the communication between data scientists and domain specialists that needs to be addressed. Specifically, many ML studies are only concerned with presenting average accuracies over an entire dataset, losing valuable insights that can be obtained at a more fine-grained patient-level analysis of classification performance. In this article, we present a case study aimed at explaining the factors that contribute to specific predictions for individual patients. Our approach takes a data-centric perspective, focusing on the structure of the data and its correlation with ML model performance. We utilize the concept of Instance Hardness, which measures the level of difficulty an instance poses in being correctly classified. By selecting the hardest and easiest to classify instances, we analyze and contrast the distributions of specific input features and extract meta-features to describe each instance. Furthermore, we individually examine certain instances, offering valuable insights into why they offer challenges for classification, enabling a better understanding of both the successes and failures of the ML models. This opens up the possibility for discussions between data scientists and domain specialists, supporting collaborative decision-making. (AU)

FAPESP's process:	21/06870-3 - Beyond algorithm selection: meta-learning for data and algorithm analysis and understanding
Grantee:	Ana Carolina Lorena
Support Opportunities:	Research Grants - Young Investigators Grants - Phase 2

Short URL