CPU-GPU Design Space Exploration

In this tutorial, you will learn how to use the Scaling-up of Cluster of Actors on Processing Element (SCAPE) method to deploy a dataflow application on a heterogeneous CPU-GPU architecture.

The following topics are covered in this tutorial:

Implementation of a Distributed FX (DiFX) correlator with PREESM
Heterogeneous CPU-GPU Partitioning
Heterogeneous CPU-GPU Code Generation
Execution on laptop or High-Performance Computing (HPC) systems

Prerequisite:

Install PREESM see getting PREESM
This tutorial is design for Unix system

Tutorial created the 8.5.2024 by O. Renaud & E. Michel

Introduction

Principle

SCAPE is a clustering method integrated into PREESM, designed to partitioning GPU-on-GPU compatible computations and CPU-on-CPU compatible computations, evaluating offloading gains and adjusting granularity [1]. The method happens upstream of the standard resource allocation process. It takes as input the GPU-oriented System-Level Architecture Model (GSLA) Model of Architecture (MoA) and the dataflow MoC and partially replicates the standard steps avoiding the need for a complete flattening such as:

Extraction:
- GPU-friendly pattern identification: This task identifies data parallel patterns such as URC and SRV as suitable dataflow models for GPUs as introduced in Section II-E.
- Subgraph generation: The identified actors are then isolated into a subgraph where rates are adjusted to contain all data parallelism.
Scheduling: The chosen scheduling strategy for the cluster is the APGAN method.
Timing: This step estimates the execution time of the subgraph running on a GPU, considering memory transfers between GPUs during execution.
Mapping: This task estimates parallelism gain and transfer loss based on GSLA information. If GPU offloading is beneficial, the process proceeds; otherwise, the original SCAPE generates a cluster of actors mapped on CPU.
Translation: The subgraph is translated into a CUDA file and sent to the rest of the resource allocation process after transformation, simplification, and optimization.

GSLA MoA

The System-Level Architecture Model (S-LAM) provides a structured framework for system architectures, including subsets like the Linear System-Level Architecture Model (LSLA) and the GPU-oriented System-Level Architecture Model (GSLA). GSLA allows internal parallelism modeling in Processing Elements (PEs), while LSLA requires modeling all PEs for reliability, leading to differences in cost definition. LSLA for GPUs requires modeling each kernel element as a processing element, making mapping and scheduling complex. We choose GSLA for this method because it handles complexities better while preserving data parallelism.

Below is a GSLA representation compose of one CPU core , one GPU kernel, one main communication node and one GPU communication node:

Use-case: DiFX correlator

The principle of the core of DiFX [2][3] is the following :

Data Alignment: Given the vast distances separating telescopes, signals they record may encounter delays due to varied travel paths. The correlator aligns these signals in time to ensure synchronicity with the observation moment.
Correlation: Once aligned, the correlator compares signals from each telescope, performing mathematical correlation by multiplying and summing received signals over a specific time interval.
Image Formation: Correlated data provide insights into the brightness and structure of observed astronomical objects.

Below is a simplified dataflow representation:

Data Acquisition: Telescope data is retrieved from disk in binary format, devoid of headers. Each binary file (xx.bin) is packed into a 2D array, with one file per antenna. Subsequently, delays between telescopes are computed based on a polynomial from the configuration file (xx.conf).

Floating Point Conversion: Raw integer-encoded data is converted to floating-point numbers. This stage also creates complex numbers with zero imaginary values and divides data into independent channels, stored in a 3D array per polarization per antenna. An “offset” correction is applied to account for geometric signal reception time effects.

Fringe Rotation: Each sample undergoes ‘fringe rotation’ to adjust for telescopes’ relative speeds (Doppler shift). This involves applying a time-varying phase shift to each sample.

FFT of Samples: “N” time samples are transformed into frequency-domain data, making it easier to extract information later.

Cross-correlation (X): Individual frequency channels of each telescope are multiplied and accumulated to form “visibilities” for each FFT block. This process generates unique baselines combinations and integrates sub-integration data over milliseconds, averaged for approximately one second to form the final visibility integration.

Accumulation: Cross-correlation values for each FFT block are added to previous iterations in the “visibilities” table, resulting in final visibility products. Phase and amplitude data for each frequency channel and baseline are stored in a vis.out file.

(more detail on running different DiFX implementation can be found on the readme of the DiFX dataflow model, notably FxCorr and Gcorr)

Project Setup

Download DiFX project from Preesm-apps
Launch Preesm and open the project: Click on “File / Import …”, then “General / Existing Projects into Workspace” then locate and import the “org.ietr.preesm.difx” project
Create an GPU-accelerated architecture. Based on a classic CPU architecture: “right click on architecture > generate a custom x86 architecture > select 1 core”. Select the GPU vertice from the palette and drop it on your design. Select a parallelComNode and drop it to your design. Connect each element with undirectedDataLink. Take the figure below as an example.
Custom you GPU node: select the node and ajust the parameter to fit your target.

Property	Value	Comment
dedicatedMemSpeed	20000	Speed of the dedicated memory (in MB/s), typically higher than unified memory.
definition	defaultGPU
hardwareId	1	Unique identifier for the GPU hardware.
id	GPU	Identifier name for the GPU in the system configuration.
memoryToUse	dedicated	Specify memory allocation type: “dedicated” for GPU-exclusive memory or “unified” for shared CPU-GPU memory.
memSize	4000	Size of the GPU’s memory (in MB), determining the amount of data it can handle.
unifiedMemSpeed	500	Speed of the unified memory (in MB/s), typically slower but shared between CPU and GPU.

Generate your scenario: right click on your project “Preesm >generate all scenarios.

Custom the Codegen.workflow to generate the appropriated CPU-GPU code:

Open the Codegen.workflow
Add a new Task vertex to your workflow and name it SCAPE. To do so, simply click on “Task” in the Palette on the right of the editor then click anywhere in the editor.
Select the new task vertex. In the “Basic” tab of its “Properties”, set the value of the field “plugin identifier” to “scape.task.identifier”.
In the “Task Variables” tab of its “Properties”, fill the varaible like this:

Parameter	Value	Comment
Level number	0	Corresponds to the hierarchical level to coarsely cluster, works with all SCAPE mode 0, 1 and 2.
SCAPE mode	0	0: match data parallelism to the target on specified level, 1: match data and pipeline parallelism to the target on specified level, 2: match data and pipeline parallelism to the target on all admissible level.
Stack size	1000000	Cluster-internal buffers are allocated statically up to this value, then dynamically.

(more detail, see Workflow Tasks Reference )

connect the new task to the “scenario” task vertex and to the “PiMM2SrDAG” task vertex as shown in the figure below.
[here insert a figure]
Right-click on the workflow “/Workflows/Codegen.workflow” and select “Preesm > Run Workflow”;

The workflow execution generates intermediary dataflow graphs that can be found in the “/Algo/generated/” directory. The C code generated by the workflow is contained in the “/Code/generated/” directory.

Execution of the DiFX correlator dataflow model

Download the IPP (Intel Integrated Performance Primitives for Linux ,2021.11.0,19 MB,Online,Mar. 27, 2024):
```
chmod +x l_ipp_oneapi_p_2021.11.0.532.sh
./l_ipp_oneapi_p_2021.11.0.532.sh
```

Execution on laptop equiped with GPU

Check if your laptop is equiped with Nvidia GPU : lspci | grep -i nvidia
Prompt:
3D controller: NVIDIA Corporation GA107M [GeForce RTX 2050] (rev a1)
Install NVIDIA CUDA Drivers & CUDA Toolkit for your system, see NVIDIA’s tutorial
Check CUDA compiler & CUDA install with nvcc -V
Prompt:
→ nvcc: NVIDIA (R) Cuda compiler driver
→ Copyright (c) 2005-2024 NVIDIA Corporation
→ Built on Thu_Mar_28_02:18:24_PDT_2024
→ Cuda compilation tools, release 12.4, V12.4.131
→ Build cuda_12.4.r12.4/compiler.34097967_0

A MakeFile is stored on the /Code folder

correct the IPP path to your IPP path
Compile and run
```
cmake .
make
./difx
```

Execution on Grid5000

Open an account grid5000 account.

The example here are for rennes.

Choose the node you need based on its characteristics (number of GPUs, memory …) presented in Rennes:Hardware - grid5000.
Check the availability of your chosen node on Rennes:node or Rennes:node(production)
(more status here)
```
# ssh connect
ssh orenaud@access.grid5000.fr
ssh rennes
#connect 1 abacus node (they host NVIDIA GPU)
oarsub -q production -p abacus1 -I
#copy the folder
scp -r ~/path/Code orenaud@access.grid5000.fr:rennes
```

A MakeFile is stored on the /Code folder

correct the IPP path to your IPP path (download IPP on the machine)
Compile and run
```
cmake .
make
./difx
```

DiFX execution generates a file containing visibility information vis.out

Display results

You can visualize your data with the notebook available here.
Import your vis.out file and run the code.

Display of the 4 average caracheristics by baselines (detailed in the notebook).

Nota Bene:
To run the original version of FxCorr and Gcorr follow the readme of the DiFX dataflow model stored on preesm-apps.

References

[1] E. Michel, O. Renaud, A. Deller, K. Desnos, C. Phillips, J.-F. Nezan, Automated Deployment of Radio Astronomy Pipeline on CPU-GPU Processing Systems: DiFX as a Case Study, IETR, Swinburne, CSIRO, 2024.

[2] A.T. Deller, S.J. Tingay, M. Bailes, & C. West, DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments, Swinburne, 2007.

[3] A.T. Deller, W.F. Brisken, C.J. Phillips, J. Morgan, W. Alef, R. Cappallo, E.Middelberg, J. Romney, H. Rottmann, S.J. Tingay, R. Wayth
DiFX2: A more flexible, efficient, robust and powerful software correlator,Swinburne, 2011.

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 873120.