STIXnet: entity and relation extraction from unstructured CTI reports

The increased frequency of cyber attacks against organizations and their potentially devastating effects has raised awareness on the severity of these threats. In order to proactively harden their defences, organizations have started to invest in Cyber Threat Intelligence (CTI), the field of Cybersecurity that deals with the collection, analysis and organization of intelligence on the attackers and their techniques. By being able to profile the activity of a particular threat actor, thus knowing the types of organizations that it targets and the kind of vulnerabilities that it exploits, it is possible not only to mitigate their attacks, but also to prevent them. Although the sharing of this type of intelligence is facilitated by several standards such as STIX (Structured Threat Information eXpression), most of the data still consists of reports written in natural language. This particular format can be highly time-consuming for Cyber Threat Intelligence analysts, which may need to read the entire report and label entities and relations in order to generate an interconnected graph from which the intel can be extracted. In this thesis, done in collaboration with Leonardo S.p.A., we provide a modular and extensible system called STIXnet for the extraction of entities and relations from natural language CTI reports. The tool is embedded in a larger platform, developed by Leonardo, called Cyber Threat Intelligence System (CTIS) and therefore inherits some of its features, such as an extensible knowledge base which also acts as a database for the entities to extract. STIXnet uses techniques from Natural Language Processing (NLP), the branch of computer science that studies the ability of a computer program to process and analyze natural language data. This field of study has been recently revolutionized by the increasing popularity of Machine Learning, which allows for more efficient algorithms and better results. After looking for known entities retrieved from the knowledge base, STIXnet analyzes the semantic structure of the sentences in order to extract new possible entities and predicts Tactics, Techniques, and Procedures (TTPs) used by the attacker. Finally, an NLP model extracts relations between these entities and converts them to be compliant with the STIX 2.1 standard, thus generating an interconnected graph which can be exported and shared. STIXnet is also able to be constantly and automatically improved with some feedback from a human analyzer, which by highlighting false positives and false negatives in the processing of the report, can trigger a fine-tuning process that will increase the tool's overall accuracy and precision. This framework can help defenders to immediately know at a glace all the gathered intelligence on a particular threat actor and thus deploy effective threat detection, perform attack simulations and strengthen their defenses, and together with the Cyber Threat Intelligence System platform organizations can be always one step ahead of the attacker and be secure against Advanced Persistent Threats (APTs).

STIXnet: entity and relation extraction from unstructured CTI reports

MARCHIORI, FRANCESCO

2021/2022

Abstract

The increased frequency of cyber attacks against organizations and their potentially devastating effects has raised awareness on the severity of these threats. In order to proactively harden their defences, organizations have started to invest in Cyber Threat Intelligence (CTI), the field of Cybersecurity that deals with the collection, analysis and organization of intelligence on the attackers and their techniques. By being able to profile the activity of a particular threat actor, thus knowing the types of organizations that it targets and the kind of vulnerabilities that it exploits, it is possible not only to mitigate their attacks, but also to prevent them. Although the sharing of this type of intelligence is facilitated by several standards such as STIX (Structured Threat Information eXpression), most of the data still consists of reports written in natural language. This particular format can be highly time-consuming for Cyber Threat Intelligence analysts, which may need to read the entire report and label entities and relations in order to generate an interconnected graph from which the intel can be extracted. In this thesis, done in collaboration with Leonardo S.p.A., we provide a modular and extensible system called STIXnet for the extraction of entities and relations from natural language CTI reports. The tool is embedded in a larger platform, developed by Leonardo, called Cyber Threat Intelligence System (CTIS) and therefore inherits some of its features, such as an extensible knowledge base which also acts as a database for the entities to extract. STIXnet uses techniques from Natural Language Processing (NLP), the branch of computer science that studies the ability of a computer program to process and analyze natural language data. This field of study has been recently revolutionized by the increasing popularity of Machine Learning, which allows for more efficient algorithms and better results. After looking for known entities retrieved from the knowledge base, STIXnet analyzes the semantic structure of the sentences in order to extract new possible entities and predicts Tactics, Techniques, and Procedures (TTPs) used by the attacker. Finally, an NLP model extracts relations between these entities and converts them to be compliant with the STIX 2.1 standard, thus generating an interconnected graph which can be exported and shared. STIXnet is also able to be constantly and automatically improved with some feedback from a human analyzer, which by highlighting false positives and false negatives in the processing of the report, can trigger a fine-tuning process that will increase the tool's overall accuracy and precision. This framework can help defenders to immediately know at a glace all the gathered intelligence on a particular threat actor and thus deploy effective threat detection, perform attack simulations and strengthen their defenses, and together with the Cyber Threat Intelligence System platform organizations can be always one step ahead of the attacker and be secure against Advanced Persistent Threats (APTs).

Scheda

Scheda DC

	Facoltà/Dipartimento
	
				Dipartimento di Matematica "Tullio Levi-Civita" - DM
			
	Corso di studio
	
				CYBERSECURITY Laurea Magistrale (D.M. 270/2004)
			
	Anno Accademico
	
				2021
			
	Titolo inglese
	
				STIXnet: entity and relation extraction from unstructured CTI reports
			
	Abstract in italiano
	
				The increased frequency of cyber attacks against organizations and their potentially devastating effects has raised awareness on the severity of these threats. In order to proactively harden their defences, organizations have started to invest in Cyber Threat Intelligence (CTI), the field of Cybersecurity that deals with the collection, analysis and organization of intelligence on the attackers and their techniques. By being able to profile the activity of a particular threat actor, thus knowing the types of organizations that it targets and the kind of vulnerabilities that it exploits, it is possible not only to mitigate their attacks, but also to prevent them.
Although the sharing of this type of intelligence is facilitated by several standards such as STIX (Structured Threat Information eXpression), most of the data still consists of reports written in natural language. This particular format can be highly time-consuming for Cyber Threat Intelligence analysts, which may need to read the entire report and label entities and relations in order to generate an interconnected graph from which the intel can be extracted.
In this thesis, done in collaboration with Leonardo S.p.A., we provide a modular and extensible system called STIXnet for the extraction of entities and relations from natural language CTI reports. The tool is embedded in a larger platform, developed by Leonardo, called Cyber Threat Intelligence System (CTIS) and therefore inherits some of its features, such as an extensible knowledge base which also acts as a database for the entities to extract.
STIXnet uses techniques from Natural Language Processing (NLP), the branch of computer science that studies the ability of a computer program to process and analyze natural language data. This field of study has been recently revolutionized by the increasing popularity of Machine Learning, which allows for more efficient algorithms and better results. After looking for known entities retrieved from the knowledge base, STIXnet analyzes the semantic structure of the sentences in order to extract new possible entities and predicts Tactics, Techniques, and Procedures (TTPs) used by the attacker. Finally, an NLP model extracts relations between these entities and converts them to be compliant with the STIX 2.1 standard, thus generating an interconnected graph which can be exported and shared. STIXnet is also able to be constantly and automatically improved with some feedback from a human analyzer, which by highlighting false positives and false negatives in the processing of the report, can trigger a fine-tuning process that will increase the tool's overall accuracy and precision.
This framework can help defenders to immediately know at a glace all the gathered intelligence on a particular threat actor and thus deploy effective threat detection, perform attack simulations and strengthen their defenses, and together with the Cyber Threat Intelligence System platform organizations can be always one step ahead of the attacker and be secure against Advanced Persistent Threats (APTs).
			
	Parola chiave
	
				CTI
NLP
Machine Learning
			
	Relatore
	
				CONTI, MAURO
			
	Appare nelle tipologie:
	
				Lauree magistrali

File in questo prodotto:

File	Dimensione	Formato
Marchiori_Francesco.pdf accesso aperto Dimensione 2.99 MB Formato Adobe PDF Visualizza/Apri	2.99 MB	Adobe PDF	Visualizza/Apri

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/33779