Data heterogeneity consideration in semi-supervised learning

Araujo, Bilza; Zhao, Liang

Full text
Author(s):	Araujo, Bilza ^{[1, 2]} ; Zhao, Liang ^[3] Total Authors: 2
Affiliation:	^[1] Fed Univ Southern Bahia, Inst Humanities Arts & Sci, BR-45810000 Porto Seguro, BA - Brazil ^[2] Univ Sao Paulo, Inst Math & Comp Sci, Dept Comp Sci, BR-13560970 Sao Paulo - Brazil ^[3] Univ Sao Paulo, Sch Philosophy Sci & Literature Ribeirao Preto, Dept Computat & Math, BR-14090901 Sao Paulo - Brazil Total Affiliations: 3
Document type:	Journal article
Source:	EXPERT SYSTEMS WITH APPLICATIONS; v. 45, p. 234-247, MAR 1 2016.
Web of Science Citations:	4
Abstract
In class (cluster) formation process of machine learning techniques, data instances are usually assumed to have equal relevance. However, it is frequently not true. Such a situation is more typical in semi-supervised learning since we have to understand the data structure of both labeled and unlabeled data at the same time. In this paper, we investigate the organizational heterogeneity of data in semi-supervised learning using graph representation. This is because graph is a natural choice to characterize relationship between any pair of nodes or any pair of groups of nodes, consequently, strategical location of each node or each group of nodes can be determined by graph measures. Specifically, two issues are addressed: (1) We propose an adaptive graph construction method, we call AdaRadius, considering the heterogeneity of local interacting structure among nodes. As a result, it presents several interesting properties, namely adaptability to data density variations, low dependency on parameters setting, and reasonable computational cost, for both pool based and incremental data. (2) Moreover, we present heuristic criteria for selecting representative data samples to be labeled. Experimental study shows that selective labeling usually gets better classification results than random labeling. To our knowledge, it still lacks investigation on both issues up to now, therefore, our approach presents an important step toward the data heterogeneity characterization not only in semi-supervised learning, but also in general machine learning. (C) 2015 Elsevier Ltd. All rights reserved. (AU)

FAPESP's process:	13/07375-0 - CeMEAI - Center for Mathematical Sciences Applied to Industry
Grantee:	Francisco Louzada Neto
Support Opportunities:	Research Grants - Research, Innovation and Dissemination Centers - RIDC


FAPESP's process:	11/50151-0 - Dynamical phenomena in complex networks: fundamentals and applications
Grantee:	Elbert Einstein Nehrer Macau
Support Opportunities:	Research Projects - Thematic Grants

Short URL