Advanced search
Start date
Betweenand

Unsupervised context detection of streaming data for classification

Grant number: 17/22896-7
Support type:Scholarships in Brazil - Doctorate
Effective date (Start): August 01, 2018
Effective date (End): March 31, 2020
Field of knowledge:Physical Sciences and Mathematics - Computer Science
Principal Investigator:Elaine Parros Machado de Sousa
Grantee:Denis Moreira dos Reis
Home Institution: Instituto de Ciências Matemáticas e de Computação (ICMC). Universidade de São Paulo (USP). São Carlos , SP, Brazil

Abstract

Learning from data streams with extreme verification latency is a challenging endeavor. Extreme verification latency means that no labels are available after the classifier deployment. Therefore, the classifier must detect and adapt to concept drifts in the absence of information about the correct classes of the examples. This perspective is much different from most of the supervised approaches in data stream learning. Frequently, the literature assumes the total availability of labeled data even in the deployment setting. Therefore, drift detectors can use actual performance data to flag changes in data distribution and classifiers can update themselves with correctly labeled data.However, many real-world applications fall in the extreme verification latency scenario. As a motivating example, consider the a sensor that classifies insects into species using the wing-beat data. Such sensor is the key for a scalable real-time surveillance of flying insects such as agricultural pests and disease vectors. Such a sensor would have to deal with concept drifts in an extreme verification setting. Although it is possible to gather labeled data in the laboratory, the obtainment of actual classes after deployment is an expensive and non-scalable task. Nevertheless, the classifier would have to face concept drifts, since ambient conditions such as temperature, humidity, air pressure and other factors influence the behavior of the insects.We observe that generally, although a large set of latent factors can cause concept drifts, there is a smaller subset of variables responsible for most of the concept drifts. Once these variables are identified, data can be gathered offline controlling these variables.Also, drifts are frequently recurrent, meaning that we can group the latent factors into a discrete and relatively small number of contexts. Changes in the latent factors cause a back and forth switch among contexts leading to recurrence.There are real applications that fulfill these assumptions. For the previously mentioned sensor, although several conditions may influence the behavior of insects, including some factors that are difficult to measure, such as availability of water and food, the temperature is the variable that most affect their wing-beat data. In the laboratory, specialized chambers can control the temperature artificially. Therefore, we can avail of plenty of labeled insect data for different temperatures.In this research project, we seek to identify in which circumstances we can identify the currently ongoing context, among of a finite set of well defined contexts, with limited and unlabeled data, and provide methods to do so. Additionally, we seek to tackle the appearance of novel contexts, i.e., contexts that are not minimally similar to any of the known contexts.