The objective of this thesis is twofold. The first objective is to detect duplicate addresses within a dataset containing addresses from 142 different countries, written in various languages. The primary challenge in this task is the lack of training data, which is addressed through two approaches: Active Learning, designed for efficient learning with minimal training data, and Un-supervised Learning techniques, such as Clustering and Dimensionality Reduction, applied to vector embeddings generated by Large Language Models. The second objective focuses on forecasting Alarm and Production Times of machines for a 14-day horizon, leveraging estimated daily production volumes. Since both target variables are interdependent on each other, various experiments are conducted using Statistical methods, Machine Learning, and Deep Learning models to explore the best possible way to forecast the target variables. The Temporal Fusion Transformer model is introduced as a key component of this research, offering promising capabilities for multi-horizon, multiple-time series forecasting and placing a strong emphasis on interpretability. Additionally, XGBoost, a powerful gradient-boosting algorithm, is used for its ability to excel in multi-time series forecasting tasks, along with Croston and TSB models that work well with intermittent data. Notably, the transformer model leverages additional features during training that may not be available during the forecasting phase.

Industrial applications of Data Science

NAGABANDI, MANOJ KUMAR
2022/2023

Abstract

The objective of this thesis is twofold. The first objective is to detect duplicate addresses within a dataset containing addresses from 142 different countries, written in various languages. The primary challenge in this task is the lack of training data, which is addressed through two approaches: Active Learning, designed for efficient learning with minimal training data, and Un-supervised Learning techniques, such as Clustering and Dimensionality Reduction, applied to vector embeddings generated by Large Language Models. The second objective focuses on forecasting Alarm and Production Times of machines for a 14-day horizon, leveraging estimated daily production volumes. Since both target variables are interdependent on each other, various experiments are conducted using Statistical methods, Machine Learning, and Deep Learning models to explore the best possible way to forecast the target variables. The Temporal Fusion Transformer model is introduced as a key component of this research, offering promising capabilities for multi-horizon, multiple-time series forecasting and placing a strong emphasis on interpretability. Additionally, XGBoost, a powerful gradient-boosting algorithm, is used for its ability to excel in multi-time series forecasting tasks, along with Croston and TSB models that work well with intermittent data. Notably, the transformer model leverages additional features during training that may not be available during the forecasting phase.
2022
Industrial applications of Data Science
Forecasting
Deduplication
Temporal Fusion
Transformers
File in questo prodotto:
File Dimensione Formato  
Master_Thesis_Manoj_Kumar_Nagabandi.pdf

accesso riservato

Dimensione 4.57 MB
Formato Adobe PDF
4.57 MB Adobe PDF

The text of this website © Università degli studi di Padova. Full Text are published under a non-exclusive license. Metadata are under a CC0 License

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.12608/52273