Advanced search
Start date
Betweenand


Time series clustering for data streams

Full text
Author(s):
Cássio Martini Martins Pereira
Total Authors: 1
Document type: Doctoral Thesis
Press: São Carlos.
Institution: Universidade de São Paulo (USP). Instituto de Ciências Matemáticas e de Computação (ICMC/SB)
Defense date:
Examining board members:
Rodrigo Fernandes de Mello; Gustavo Enrique de Almeida Prado Alves Batista; Estevam Rafael Hruschka Júnior; Guilherme Pimentel Telles; Fernando José von Zuben
Advisor: Rodrigo Fernandes de Mello
Abstract

Recently, the data streams mining area has gained importance, which aims to extract useful information from massive and continuous data sources that evolve over time. One of the most popular techniques in this area is clustering, which aims to structure large volumes of data into hierarchies or partitions, such that similar objects are placed in the same group. Several algorithms were proposed in this context, however most of them focused on the clustering of streams composed of multidimensional points. Few studies have focused on clustering streaming time series, which are characterized by being collections of observations sampled sequentially along time. Current techniques for clustering streaming time series have a limitation in the choice of the similarity measure, as most are based on a simple correlation, such as Pearson. This thesis shows that even for classic time series models, such as those from Box and Jenkins, the Pearson correlation is not capable of detecting similarity, despite dealing with series originating from the same mathematical model and the same parametrization. This limitation in current techniques motivated this work to consider time series generating models, i.e., generating equations, through the use of several descriptive measures, such as Auto Mutual Information, the Hurst Exponent and several others. The hypothesis is that through the use of several descriptive measures, a better characterization of time series generating models can be achieved, which in turn will lead to better clustering quality. In that context, several descriptive measures were evaluated and then used as input to a new tree-based clustering algorithm, entitled TS-Stream. Experiments were conducted with synthetic data sets composed of various time series models, confirming the superiority of TS-Stream when compared to ODAC, the most successful technique in the literature for this task. Experiments with real-world time series from stock market data of the NYSE and NASDAQ showed that the use of TS-Stream in the selection of stocks, by the creation of a diversified portfolio, can increase the returns of the investment in several orders of magnitude when compared to trading strategies solely based on the Moving Average Convergence Divergence financial indicator (AU)

FAPESP's process: 10/05062-6 - Wavelet-based clustering for data streams.
Grantee:Cássio Martini Martins Pereira
Support Opportunities: Scholarships in Brazil - Doctorate