Unifying subgraph matching and subsequence joins in hybrid graphs

Property Graph Model (PGM) is a multigraph comprising a set of nodes V, a set of edges E, where each object (node or edge) can be enriched with a set of key- value pairs called properties; each object can have also labels. PGM is a powerful data model used to describe real-world scenarios including fraud detections, social networks, where we model users as nodes and connections between them as edges, or in IP networks, in which, nodes are routers and edges are links between routers. Graphs are also employed in medicine to describe protein-to-protein relationships and to track how a certain protein reacts to a given drug. Due to their versatility and expressiveness power, graphs have become the perfect data model to treat relationships thanks to their ability to store connections as first-class citizens. Since their adoption, Graph Database Management System (GDBMS) have been favoured for their ability to natively store concepts (nodes) and their relationships (edges). However, in the era of Big Data, where each component of the graph changes continuously, existing data models are unable to handle this dynamic nature of data. New necessary capabilities are emerging, from handling Temporal Graphs within the PGM, i.e. timestamped properties, nodes and edges, and the ability to store flows of data in the graph in the form of Time Series for reasoning on evolving data both in terms of graph connections and graph properties, i.e. finding all the nodes of a graph who have a similar pattern with the respect to a changing property. In this work, the Hygraph data model will be formalized as a hybrid model able to combine Temporal Property Graphs with Time Series; the work continues with the definition of the subsequence join for discovering recurrent patterns within one or more time series; then we study the combination of subsequence join with subgraph matching for discovering new insights about data that we are managing with the newly-defined model. We will end the thesis with the evaluation of prototypes of the Hygraph model, leveraging single storage system and polyglot storage system using Time Series Database (TSDB) and GDBMS in order to understand implementation is best suited for implementing the Hygraph model and performing subsequence join with subgraph matching.

PGM è un multigrafo che comprende un insieme di nodi V, un insieme di rami E, dove ogni oggetto (nodo o ramo) può essere arricchito con un insieme di coppie chiave-valore chiamate proprietà; inoltre ogni oggetto può avere anche un’etichetta. PGM è un potente data model usato per descrivere scenari del mondo reale che includono rilevazione di frodi, social networks, dove si modellano gli utenti come nodi e le connessioni tra loro come rami, o nelle IP networks, nelle quali i routers sono visti come nodi e i links come rami. I grafi sono anche utilizzati in medicina per descrivere le relazioni tra proteine e per tenere traccia di come una proteina reagisce ad un certo farmaco. Grazie alla loro versatilità e potere espressivo, i grafi sono diventati il data model ideale per gestire le relazioni grazie alla loro abilità di salvarle come first-class citizens. Dalla loro adozione, i GDBMS sono stati preferiti per la loro capacità di memorizzare nativamente concetti (nodi) e le loro relazioni (rami). Tuttavia, nell’era dei Big Data, dove ogni componente del grafo cambia continuamente, i data model esistenti non sono in grado di gestire questa natura dinamica dei dati. Nuove necessità emergono, come la gestione dei Temporal Graphs all’interno di PGM, i.e. proprietà, nodi o rami con timestamp, e la possibilità di salvare flussi di dati nel grafo in forma di Time Series per ragionare sui dati in evoluzione sia in termini di connessioni del grafo che delle sue proprietà, i.e. trovare tutti i nodi di un grafo che hanno un pattern similare rispetto ad una proprietà che varia nel tempo. In questo lavoro di tesi, il data model Hygraph verrà formalizzato come un modello ibrido in grado di combinare Temporal Property Graphs e Time Series; il lavoro prosegue con la definizione di subsequence join per la scoperta di pattern ricorrenti all’interno di una o più Time Series; in seguito verrà studiata la combinazione della subsequence join con il subgraph matching per scoprire nuove informazioni riguardo ai dati che vengono salvati nel nuovo modello appena definito. Concluderemo la tesi con il confronto di alcuni prototipi del modello Hygraph, sfruttando sistemi ad archiviazione singola e sistemi ad archiviazione poliglotta usando TSDB e GDBMS per capire quale implementazione sia la più efficace per implementare il modello Hygraph e per combinare la subsequence join insieme al subgraph matching.