In this thesis, we analyze an industry network traffic dataset containing hundreds of sensitive services related to the infrastructure of an oil and gas company. The main objective is to detect possible network and cybersecurity operations events resulting from behavioural changes in unlabelled data. Indeed, given the real-world nature of the studied dataset, no labels are found in the data, and we work in an unsupervised learning framework. We implement an automatic detection system for server and user traffic behavioural changes. We proactively detect long-term, subtle events with an observation window spanning a complete user work shift of eight hours. We start our research by grounding our intuition in server data's relatively less complex context. In particular, we perform clustering on some feature-space and try to characterize the server data with a Hidden Markov Model. Then, after exploring the difficulty of automatically learning a discrete Markov chain representation for the user data, we resort to field-expert estimations of state thresholds. There, we analyze the case of independent univariate state representations for each metric under observation and the case of a single multivariate state representation. While the first approach allows for the detection of uncharacteristic path probabilities for each metric independently, the second, multivariate, approach considers all metrics simultaneously such that not only unlikely state transitions may be detected, but also the presence of rare multivariate states. Finally, the system provides a ranking of user IP addresses behavioural change scores, allowing network administrators to plan their work capacity more efficiently.

In this thesis, we analyze an industry network traffic dataset containing hundreds of sensitive services related to the infrastructure of an oil and gas company. The main objective is to detect possible network and cybersecurity operations events resulting from behavioural changes in unlabelled data. Indeed, given the real-world nature of the studied dataset, no labels are found in the data, and we work in an unsupervised learning framework. We implement an automatic detection system for server and user traffic behavioural changes. We proactively detect long-term, subtle events with an observation window spanning a complete user work shift of eight hours. We start our research by grounding our intuition in server data's relatively less complex context. In particular, we perform clustering on some feature-space and try to characterize the server data with a Hidden Markov Model. Then, after exploring the difficulty of automatically learning a discrete Markov chain representation for the user data, we resort to field-expert estimations of state thresholds. There, we analyze the case of independent univariate state representations for each metric under observation and the case of a single multivariate state representation. While the first approach allows for the detection of uncharacteristic path probabilities for each metric independently, the second, multivariate, approach considers all metrics simultaneously such that not only unlikely state transitions may be detected, but also the presence of rare multivariate states. Finally, the system provides a ranking of user IP addresses behavioural change scores, allowing network administrators to plan their work capacity more efficiently.

Unsupervised Anomaly Detection for Industry Cybersecurity Operations

LEON CASTELL, ALEJANDRO
2022/2023

Abstract

In this thesis, we analyze an industry network traffic dataset containing hundreds of sensitive services related to the infrastructure of an oil and gas company. The main objective is to detect possible network and cybersecurity operations events resulting from behavioural changes in unlabelled data. Indeed, given the real-world nature of the studied dataset, no labels are found in the data, and we work in an unsupervised learning framework. We implement an automatic detection system for server and user traffic behavioural changes. We proactively detect long-term, subtle events with an observation window spanning a complete user work shift of eight hours. We start our research by grounding our intuition in server data's relatively less complex context. In particular, we perform clustering on some feature-space and try to characterize the server data with a Hidden Markov Model. Then, after exploring the difficulty of automatically learning a discrete Markov chain representation for the user data, we resort to field-expert estimations of state thresholds. There, we analyze the case of independent univariate state representations for each metric under observation and the case of a single multivariate state representation. While the first approach allows for the detection of uncharacteristic path probabilities for each metric independently, the second, multivariate, approach considers all metrics simultaneously such that not only unlikely state transitions may be detected, but also the presence of rare multivariate states. Finally, the system provides a ranking of user IP addresses behavioural change scores, allowing network administrators to plan their work capacity more efficiently.
2022
Unsupervised Anomaly Detection for Industry Cybersecurity Operations
In this thesis, we analyze an industry network traffic dataset containing hundreds of sensitive services related to the infrastructure of an oil and gas company. The main objective is to detect possible network and cybersecurity operations events resulting from behavioural changes in unlabelled data. Indeed, given the real-world nature of the studied dataset, no labels are found in the data, and we work in an unsupervised learning framework. We implement an automatic detection system for server and user traffic behavioural changes. We proactively detect long-term, subtle events with an observation window spanning a complete user work shift of eight hours. We start our research by grounding our intuition in server data's relatively less complex context. In particular, we perform clustering on some feature-space and try to characterize the server data with a Hidden Markov Model. Then, after exploring the difficulty of automatically learning a discrete Markov chain representation for the user data, we resort to field-expert estimations of state thresholds. There, we analyze the case of independent univariate state representations for each metric under observation and the case of a single multivariate state representation. While the first approach allows for the detection of uncharacteristic path probabilities for each metric independently, the second, multivariate, approach considers all metrics simultaneously such that not only unlikely state transitions may be detected, but also the presence of rare multivariate states. Finally, the system provides a ranking of user IP addresses behavioural change scores, allowing network administrators to plan their work capacity more efficiently.
DataAcquisition
MachineLearning
DataAnalysis
File in questo prodotto:
File Dimensione Formato  
Leon_Alejandro.pdf

accesso riservato

Dimensione 3.64 MB
Formato Adobe PDF
3.64 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/54843