Multimodal data fusion for sensitive scene localization

Moreira, Daniel; Avila, Sandra; Perez, Mauricio; Moraes, Daniel; Testoni, Vanessa; Valle, Eduardo; Goldenstein, Siome; Rocha, Anderson

Texto completo
Autor(es):	Moreira, Daniel ^[1] ; Avila, Sandra ^{[1, 2]} ; Perez, Mauricio ^[1] ; Moraes, Daniel ^[1] ; Testoni, Vanessa ^[3] ; Valle, Eduardo ^[2] ; Goldenstein, Siome ^[1] ; Rocha, Anderson ^[1] Número total de Autores: 8
Afiliação do(s) autor(es):	^[1] Univ Estadual Campinas, Inst Comp, Campinas, SP - Brazil ^[2] Univ Estadual Campinas, Sch Elect & Comp Engn, Campinas, SP - Brazil ^[3] Samsung Res Inst Brazil, Campinas, SP - Brazil Número total de Afiliações: 3
Tipo de documento:	Artigo Científico
Fonte:	Information Fusion; v. 45, p. 307-323, JAN 2019.
Citações Web of Science:	4
Resumo
The very idea of hiring humans to avoid the indiscriminate spread of inappropriate sensitive content online (e.g., child pornography and violence) is daunting. The inherent data deluge and the tediousness of the task call for more adequate approaches, and set the stage for computer-aided methods. If running in the background, such methods could readily cut the stream flow at the very moment of inadequate content exhibition, being invaluable for protecting unwary spectators. Except for the particular case of violence detection, related work to sensitive video analysis has mostly focused on deciding whether or not a given stream is sensitive, leaving the localization task largely untapped. Identifying when a stream starts and ceases to display inappropriate content is key for live streams and video on demand. In this work, we propose a novel multimodal fusion approach to sensitive scene localization. The solution can be applied to diverse types of sensitive content, without the need for step modifications (general purpose). We leverage the multimodality data nature of videos (e.g., still frames, video space-time, audio stream, etc.) to effectively single out frames of interest. To validate the solution, we perform localization experiments on pornographic and violent video streams, two of the commonest types of sensitive content, and report quantitative and qualitative results. The results show, for instance, that the proposed method only misses about five minutes in every hour of streamed pornographic content. Finally, for the particular task of pornography localization, we also introduce the first frame-level annotated pornographic video dataset to date, which comprises 140 h of video, freely available for downloading. (AU)

Processo FAPESP:	17/12646-3 - Déjà vu: coerência temporal, espacial e de caracterização de dados heterogêneos para análise e interpretação de integridade
Beneficiário:	Anderson de Rezende Rocha
Modalidade de apoio:	Auxílio à Pesquisa - Temático

URL curto