A statistical analysis of intrinsic bias of network security datasets for training machine learning mechanisms

Silva, Joao Vitor V.; de Oliveira, Nicollas R.; Medeiros, Dianne S., V; Lopez, Martin Andreoni; Mattos, Diogo M. F.

Full text
Author(s):	Silva, Joao Vitor V. ; de Oliveira, Nicollas R. ; Medeiros, Dianne S., V ; Lopez, Martin Andreoni ; Mattos, Diogo M. F. Total Authors: 5
Document type:	Journal article
Source:	ANNALS OF TELECOMMUNICATIONS; v. N/A, p. 17-pg., 2022-02-12.
Abstract
Machine learning mechanisms for network intrusion detection systems lack accurate evaluation, comparison, and deployment due to the scarcity of well-constructed datasets. In this paper, we propose a statistical analysis of the features contained in four highly used security datasets. We conclude that the analyzed datasets should not be used as a benchmark for creating novel anomaly-based mechanisms for intrusion detection systems. The analyzed datasets introduce a biased classification since features are over-correlated, and most of the features are capable of making a complete distinction between normal and attack flows. Our proposed methodology analyzes the correlation among features instead of checking for redundant values or data imbalance. The results align with the performance of three machine learning techniques. We show that biased classification occurs due to a significant difference between attack and normal data. The syntactically generated features are statistically different between normal and attack classes, which implies overfitting in the machine learning approaches. (AU)

FAPESP's process:	18/23062-5 - MEGACHAIN: blockchain for integration, privacy and audit of megacity systems
Grantee:	Célio Vinicius Neves de Albuquerque
Support Opportunities:	Regular Research Grants

Short URL