Progettazione ed implementazione di un cluster Apache Kafka

This thesis reports on the activity carried out as part of the internship at the company Acciaierie Valbruna S.p.A. The internship aimed to deploy and manage an Apache Kafka environment for testing purposes. Apache Kafka is an open-source project that provides a platform for handling real-time data feeds between different data storage in a reliable and scalable way. The environment was initially deployed through Ansibile, an infrastructure as a code tool that enables the automatic management and deployment of Linux and Windows systems, and after that, the same environment was deployed as a cluster of containers using Podman, an alternative to Docker. The Apache Kafka cluster handles the replication of a MySQL database onto an OracleDB 10 database (acting as an Oracle DB 9 through emulation) by leveraging a suite of plugins developed for Kafka Connect, which is a tool distributed with Apache Kafka that facilitates the connection of the cluster to several kinds of data systems. A schema registry is connected to the cluster to optimize space consumption and to potentially handle more easily any change to the structure of the data sources. This registry is a repository that manages and validates the schemas used in the system, in this case, the schemas of the source database. The first part of this document covers the analysis of the requirements put forward by the company while also focusing on the design decisions that arose from the constraints and the challenges that the test environment presented. The second part is dedicated to the detailed description of the main components of the infrastructure. The third part is dedicated to the actual deployment of the environment with Ansible and how this approach led to a much more efficient deploying process that can be managed more easily compared to a manual one. Furthermore, the same environment was deployed as an ensemble of containers, showing this approach’s challenges and constraints while delving into the pros and cons of each method. The final part of the document covers the feedback received from the company and what improvements might be necessary to use the cluster in production.

Questa tesi riporta l’attività svolta nell’ambito del tirocinio presso l’azienda Acciaierie Valbruna S.p.A. Lo stage ha avuto come obiettivo l’implementazione e la gestione di un ambiente Apache Kafka a scopo di test. Apache Kafka è un progetto open-source che fornisce una piattaforma per la gestione di flussi di dati in tempo reale tra diversi archivi di dati in modo affidabile e scalabile. L’ambiente è stato inizialmente distribuito tramite Ansibile, uno strumento che consente la gestione e la distribuzione automatica di sistemi Linux e Windows tramite il paradigma Infrastrutcture as a Code, e successivamente lo stesso ambiente è stato distribuito come cluster di container utilizzando Podman, un’alternativa a Docker. Il cluster Apache Kafka gestisce la replica di un database MySQL su un database OracleDB 10 (che si comporta come un Oracle DB 9 attraverso l’emulazione) sfruttando una suite di plugin sviluppati per Kafka Connect, che è uno strumento distribuito assieme ad Apache Kafka che facilita la connessione del cluster a diversi tipi di sistemi di dati. Uno schema registry viene inoltre connesso al cluster per ottimizzare il consumo di spazio e per gestire più facilmente eventuali modifiche alla struttura delle fonti di dati. Questo registry è un repository che gestisce e convalida gli schema utilizzati nel sistema, in questo caso gli schema del database sorgente. La prima parte di questo documento riguarda l’analisi dei requisiti proposti dall’azienda concentrandosi anche sulle decisioni progettuali che sono scaturite dai vincoli imposti e le sfide che l’ambiente di test ha presentato. La seconda parte è dedicata alla descrizione dettagliata dei principali componenti dell’infrastruttura. La terza parte è dedicata all’effettiva implementazione dell’ambiente con Ansible e a come questo approccio ha portato a un processo di deploy molto più efficiente e più facilmente gestibile rispetto a quello manuale. Inoltre, lo stesso ambiente è stato distribuito come un insieme di container, mettendo in mostra le sfide e i vincoli di questo approccio e approfondendo i pro e i contro di ciascun metodo. La parte finale del documento riguarda il feedback ricevuto dall’azienda e i miglioramenti necessari che potrebbero essere necessari per poter utilizzare il cluster in produzione.