GPU acceleration of a high-resolution fully compressible Navier-Stokes solver using OpenACC

In this work the GPU (Graphic Processing Unit) acceleration of a finite difference solver for unsteady robust all-around Navier-Stokes equations, URANOS, is presented. The code is oriented to modern HPC (High Performance Computing) platforms thanks to MPI (Message Passing Interface) parallelization and the ability to run on multi-GPU architectures. The porting has been executed using the OpenACC programming model: an high-level directive-based paradigm which allows to make a code executable in GPU independently of the available architecture. OpenACC, therefore, has the advantage of being simpler and faster to implement than lower-level strategies, such as CUDA (Compute Unified Device Architecture) for NVIDIA, however its portability and simplicity is paid in slightly lower parallel performance. The development of the porting has been performed targeting Marconi100 and Galileo100 clusters, accelerated by NVIDIA Tesla V100 GPUs, present at CINECA. The GPU version of the code has been validated through a three-dimensional DNS (Direct Numerical Simulation) of a pressure-driven turbulent channel flow, using a proper literature result of the same test case as reference. In a later chapter the computational power of a NVIDIA V100 GPU over a single IBM POWER9 AC922 CPU core is demonstrated: a x151 speed-up has been achieved. Finally, this work shows that using a single NVIDIA V100 GPU can be up to 6.2x faster than using 20 Intel Xeon E5-2698 v4 CPU cores, so an entire CPU node, which is in accordance with the expected result.

In questo lavoro viene presentata l'accelerazione GPU (Graphic Processing Unit) di un solutore alle differenze finite delle equazioni di Navier-Stokes (unsteady, robust, all-around), URANOS. Il codice è orientato alle moderne piattaforme per l’HPC (High Performance Computing) grazie alla parallelizzazione MPI (Message Passing Interface) e alla possibilità di girare su architetture multi-GPU. Il porting è stato eseguito utilizzando il modello di programmazione OpenACC: un paradigma di alto livello basato su direttive di compilazione che permette di rendere eseguibile un codice in GPU indipendentemente dall'architettura disponibile. OpenACC, quindi, ha il vantaggio di essere più semplice e veloce da implementare rispetto a strategie di basso livello, come CUDA (Compute Unified Device Architecture) per NVIDIA, tuttavia la sua portabilità e semplicità si paga con prestazioni parallele leggermente inferiori. Lo sviluppo del porting è stato effettuato sui cluster Marconi100 e Galileo100, accelerati dalle GPU NVIDIA Tesla V100, presenti al CINECA. La versione GPU del codice è stata validata attraverso una DNS (Direct Numerical Simulation) tridimensionale di un canale turbolento, utilizzando come riferimento un appropriato risultato in letteratura. In un capitolo successivo viene dimostrata la potenza di calcolo di una GPU NVIDIA V100 versus un singolo core della CPU IBM POWER9 AC922: è stato raggiunto un x151 di incremento di velocità. Infine, questo lavoro mostra che l'utilizzo di una singola GPU NVIDIA V100 può essere fino a 6,2 volte più veloce rispetto all'utilizzo di 20 core CPU Intel Xeon E5-2698 v4, quindi un intero nodo CPU, il che è conforme col risultato previsto.