Determining relevance of social-media posts for forensic event analysis

José Dorivaldo Nascimento Souza Júnior

Full text
Author(s):	José Dorivaldo Nascimento Souza Júnior Total Authors: 1
Document type:	Doctoral Thesis
Press:	Campinas, SP.
Institution:	Universidade Estadual de Campinas (UNICAMP). Instituto de Computação
Defense date:	2025-01-14
Examining board members:	Anderson de Rezende Rocha; Cristina Nader Vasconcelos; João Paulo Papa; Marcelo da Silva Reis; Levy Boccato
Advisor:	Anderson de Rezende Rocha
Abstract
When a large-scale forensic event occurs, related posts are quickly shared on social networks, offering potentially valuable information for later forensic investigation by providing different perspectives across various moments of the event. However, analyzing social media data about an event is often hindered by an overwhelming volume of irrelevant items retrieved during the collection process, such as memes or images from previous events. Manually sanitizing these datasets is unfeasible, as they may contain thousands of items. Therefore, we investigated machine learning techniques that rely on limited labeled data to expedite this process and reduce the required human effort. Our work followed three main directions. The first focused on representing posts for later classification, experimenting with different pre-trained neural networks. A single descriptor may be insufficient for classifying social media posts, as they tend to be multi-modal. Additionally, even for visual classification (uni-modal), distinct aspects of an image may vary in relevance for understanding the event. Thus, we explored combining various image and text models with fusion techniques to consolidate different feature vectors into a single descriptor for classification. The second approach addressed classification with minimal annotated samples. Labeling hundreds or thousands of posts for a new event is costly and often impractical in real scenarios. Therefore, models for this problem should be able to learn using only a few annotations. In this sense, we studied semi-supervised techniques, from graph-based methods to pseudo-labeling. Semi-supervised methods generally mitigate the scarcity of annotations by incorporating knowledge from unlabeled data into the model. Additionally, we explored using data from previous events to diversify the training set based on the hypothesis that the unrelated data from different events might share some similarities in a way that the related data does not. Our final research path aimed to introduce interactivity into the pipeline. Another way to address the limited availability of labeled data is by focusing on key instances that provide the most value for the training process. From an initially fully unlabeled dataset, the idea was to obtain some data using instance selection and request the labels of this subset to an oracle, who, in a real-world scenario, might be the forensic expert. After initial training using the acquired labels, we experimented with active learning, leveraging the model’s uncertainty about instances as an additional selection criterion. Through a series of carefully designed experiments, we demonstrate that these research directions significantly enhance the overall performance of automated methods for this task. Our findings suggest that the approaches taken in this work strengthen the analysis of large-scale social media datasets, making forensic investigations more feasible, efficient, and accurate (AU)

FAPESP's process:	20/02241-9 - Pattern discovery and event highlight from heterogenous sources
Grantee:	José Dorivaldo Nascimento Souza Júnior
Support Opportunities:	Scholarships in Brazil - Doctorate (Direct)

Short URL