A performance evaluation of pixel-wise voting networks for object pose estimation

Computer vision is an interdisciplinary field of study that studies algorithms and techniques to enable computers to recognise subjects and extract useful information within an image or other visual input. In other words, it aims to make machines capable of reconstructing a context around an image, giving it real meaning. Among the most important tasks in this area is the 6DoF pose estimation, i.e. the detection of the pose (translation and rotation) of an object given an input image. This thesis employs PVNet, one of the newest and best-known methods in the literature, to perform several tests: the effectiveness of introducing the DSAC module, the influence of the Pnp type on performance, the validity of using synthetic datasets and the search for an effective strategy for generating them, the dependence of the network on a quantity of real images in the dataset during the training set, and the search for optimal parameters for the score and loss functions. PVNet variants were trained using the LINEMOD dataset. The experiments showed that: 1) item the best configuration turns out to be the one with DSAC and with EPnP; 2) the more the synthetic dataset generation strategy produces varied data close to reality, the more effective it is; 3) the network turns out to be very dependent on real data in the training phase; 4) the right calibration of the parameters of the DSAC module and the loss function can make the network achieve very good results.

La Computer Vision è un campo di studi interdisciplinare che studia algoritmi e tecniche per permettere ai computer di riconoscere i soggetti e di estrarre informazioni utili all’interno di un’immagine o di altri input visivi. In altre parole, essa mira a rendere le macchine capaci di ricostruire un contesto intorno all’immagine, dandole un vero e proprio significato. Tra i compiti più importanti in questo ambito troviamo la stima della posa a 6 gradi di libertà, ovvero l’individuazione della posa (traslazione e rotazione) di un oggetto data un’immagine in input. Questa tesi impiega PVNet, uno tra i metodi più recenti e noti in letteratura, per eseguire diversi test: l’efficacia dell’introduzione del modulo DSAC, l’influenza della tipologia del Pnp nelle performances, la validità dell’utilizzo di dataset sintetici e la ricerca di una strategia efficacie per la loro generazione, la dipendenza della rete a una quantità di immagini reali nel dataset durante il training set e la ricerca dei parametri ottimali per le funzioni di score e di perdita. Le varianti di PVNet sono state addestrate utilizzando il dataset LINEMOD. Dalle sperimentazioni è emerso che: 1) la configurazione migliore risulta essere quella con DSAC e con EPnP; 2) più la strategia di generazione di dataset sintetici realizza dati vari e vicini alla realtà più risulta essere efficace; 3) la rete risulta essere molto dipendente dai dati reali in fase di training; 4) la giusta taratura dei parametri del modulo DSAC e della loss function può far raggiungere alla rete dei risultati molto buoni.