Recovering Occlusion Aware Depth and Image using Rotating Point Spread Function

Depth estimation is a key challenge in Computer Vision. In partic- ular, when we consider the limitations of traditional techniques. In the monocular depth estimation field, the presence of occlusion boundaries is one of the most critical issues, we can find them as depth discontinu- ity along the edges of the objects, where the object in the foreground occludes the objects in the background. This results in incomplete or inaccurate depth prediction, making it difficult to extract accurarte ge- ometry information from the scene. To this end, recent studies have shown how coded aperture-based methods using phase and/or amplitude masks can encode strong depth cues within 2D images using depth-dependent point spread functions (PSFs). In this thesis, we propose a new approach to address the prob- lem of occlusion boundaries with the aim of improving the result for depth estimation. In our case, the depth dependency is achieved using a phase mask that is jointly optimized with the weights of a convolutional neural network in an end-to-end manner. A fully-working camera model is used to simulate the imaging system that can reliably estimate the depth map starting from a single RGB image. Compared to the most common methods used to solve occlusion boundaries in monocular depth estimation problems, in our final pipeline we propose a preconditioning step that aims at reducing the total effort required from the neural net- work, reducing the total time required to train the network, and achiev- ing better results in terms of accuracy over the final estimate. In this preconditioning step is already computed a raw estimate of the depth, using a well-known deblurring strategy that reconstruct the details of the image in the region that correspond to a specific level of depth. In this way a layered image is already processed and the final neural network performed only an association operation between the various layers. Moreover, to address the problem of image quality degradation due to the PSF-Blurring effects, our network can recover the all-in-sharp image along with the depth estimate, starting from the output of the preconditioning step.

La stima della profondità rappresenta una sfida in qualsiasi am- bito relativo alle applicazioni di Computer Vision. In particolare, se si considerano le limitazioni delle tecniche tradizionali in fotografia. Nel campo della stima della profondità monoculare, la presenza degli oc- clusion boundaries è uno dei problemi più critici; essi possono essere individuati come discontinuità di profondità lungo i bordi degli oggetti, dove l’oggetto in primo piano copre gli oggetti sullo sfondo. Ciò comporta una stima della profondità incompleta o inaccurata, rendendo difficile es- trarre informazioni significative dalle scene in esame. A tal fine, recenti studi hanno dimostrato come i metodi basati sulla coded aperture, uti- lizzando maschere di fase e/o di ampiezza, possano codificare segnali di profondità all’interno di immagini 2D utilizzando le Point Spread Func- tions (PSF) dipendenti dalla profondità. In questa tesi, proponiamo un nuovo approccio per affrontare il problema degli occlusion boundaries con l’obiettivo di migliorare il risultato finale per la stima della profon- dità. Nel nostro caso, la dipendenza dalla profondità è introdotta da una maschera di fase che viene ottimizzata assieme ai pesi di una rete neurale con approccio end-to-end. In fase di sviluppo viene utilizzato un mod- ello di camera che permette di simulare l’intero sistema presente in una fotocamera reale che può stimare affidabilmente la mappa di profondità a partire da una singola immagine RGB. Rispetto ai metodi più comuni utilizzati per risolvere i problemi degli occlusion boundaries nell’ambito della monocular depth estimation, la nostra pipeline finale utilizza uno step di pre-condizionamento, inserito con lo scopo di ridurre lo sforzo totale richiesto dalla rete, riducendo così il tempo totale necessario per addestrare la rete e ottenendo risultati migliori in termini di accuratezza sulla stima finale. In questo step di precondizionamento viene calco- lata una stima rozza della profondità, modificando un noto algoritmo di riduzione del rumore che ricostruisce i dettagli dell’imagine nelle regioni che corrispondo a specifici livelli di profondità. In questo modo otteni- amo una imagine stratificata e la rete neurale finale dovrà eseguire solo una associazione tra i vari livelli. Inoltre, per affrontare il problema della degradazione nella qualità dell’immagine dovuta agli effetti di sfocatura PSF, la nostra rete può recuperare l’immagine completamente a fuoco insieme alla stima di pro- fondità, a partire dal risultato dello step di pre-condizionamento.