CPU-GPU Design Space Exploration

In this tutorial, you will learn how to use the Scaling-up of Cluster of Actors on Processing Element (SCAPE) method to deploy a dataflow application on a heterogeneous CPU-GPU architecture.

The following topics are covered in this tutorial:

  • Implementation of a Distributed FX (DiFX) correlator with PREESM
  • Heterogeneous CPU-GPU Partitioning
  • Heterogeneous CPU-GPU Code Generation
  • Execution on laptop or High-Performance Computing (HPC) systems

Prerequisite:

  • Install PREESM see getting PREESM
  • This tutorial is design for Unix system
Tutorial created the 8.5.2024 by O. Renaud & E. Michel

Introduction

Principle

SCAPE is a clustering method integrated into PREESM, designed to partitioning GPU-on-GPU compatible computations and CPU-on-CPU compatible computations, evaluating offloading gains and adjusting granularity [1]. The method happens upstream of the standard resource allocation process. It takes as input the GPU-oriented System-Level Architecture Model (GSLA) Model of Architecture (MoA) and the dataflow MoC and partially replicates the standard steps avoiding the need for a complete flattening such as:

  • Extraction:
    • GPU-friendly pattern identification: This task identifies data parallel patterns such as URC and SRV as suitable dataflow models for GPUs as introduced in Section II-E.
    • Subgraph generation: The identified actors are then isolated into a subgraph where rates are adjusted to contain all data parallelism.
  • Scheduling: The chosen scheduling strategy for the cluster is the APGAN method.
  • Timing: This step estimates the execution time of the subgraph running on a GPU, considering memory transfers between GPUs during execution.
  • Mapping: This task estimates parallelism gain and transfer loss based on GSLA information. If GPU offloading is beneficial, the process proceeds; otherwise, the original SCAPE generates a cluster of actors mapped on CPU.
  • Translation: The subgraph is translated into a CUDA file and sent to the rest of the resource allocation process after transformation, simplification, and optimization.

GSLA MoA

The System-Level Architecture Model (S-LAM) provides a structured framework for system architectures, including subsets like the Linear System-Level Architecture Model (LSLA) and the GPU-oriented System-Level Architecture Model (GSLA). GSLA allows internal parallelism modeling in Processing Elements (PEs), while LSLA requires modeling all PEs for reliability, leading to differences in cost definition. LSLA for GPUs requires modeling each kernel element as a processing element, making mapping and scheduling complex. We choose GSLA for this method because it handles complexities better while preserving data parallelism.

Below is a GSLA representation compose of one CPU core , one GPU kernel, one main communication node and one GPU communication node:

Use-case: DiFX correlator

The principle of the core of DiFX [2][3] is the following :

  1. Data Alignment: Given the vast distances separating telescopes, signals they record may encounter delays due to varied travel paths. The correlator aligns these signals in time to ensure synchronicity with the observation moment.
  2. Correlation: Once aligned, the correlator compares signals from each telescope, performing mathematical correlation by multiplying and summing received signals over a specific time interval.
  3. Image Formation: Correlated data provide insights into the brightness and structure of observed astronomical objects.

Below is a simplified dataflow representation:

Data Acquisition: Telescope data is retrieved from disk in binary format, devoid of headers. Each binary file (xx.bin) is packed into a 2D array, with one file per antenna. Subsequently, delays between telescopes are computed based on a polynomial from the configuration file (xx.conf).

Floating Point Conversion: Raw integer-encoded data is converted to floating-point numbers. This stage also creates complex numbers with zero imaginary values and divides data into independent channels, stored in a 3D array per polarization per antenna. An “offset” correction is applied to account for geometric signal reception time effects.

Fringe Rotation: Each sample undergoes ‘fringe rotation’ to adjust for telescopes’ relative speeds (Doppler shift). This involves applying a time-varying phase shift to each sample.

FFT of Samples: “N” time samples are transformed into frequency-domain data, making it easier to extract information later.

Cross-correlation (X): Individual frequency channels of each telescope are multiplied and accumulated to form “visibilities” for each FFT block. This process generates unique baselines combinations and integrates sub-integration data over milliseconds, averaged for approximately one second to form the final visibility integration.

Accumulation: Cross-correlation values for each FFT block are added to previous iterations in the “visibilities” table, resulting in final visibility products. Phase and amplitude data for each frequency channel and baseline are stored in a vis.out file.

(more detail on running different DiFX implementation can be found on the readme of the DiFX dataflow model, notably FxCorr and Gcorr)

Project Setup

  • Download DiFX project from Preesm-apps
  • Launch Preesm and open the project: Click on “File / Import …”, then “General / Existing Projects into Workspace” then locate and import the “org.ietr.preesm.difx” project
  • Generate your architecture: right click on your project “Preesm >generate custom CPU-GPU archi
    • select the number of CPU core e.g.: 2
    • select the number of GPU core e.g.: 1
  • Custom you GPU node: select the node and ajust the parameter to fit your target.
  • Generate your scenario: right click on your project “Preesm >generate all scenarios.
  • Custom the Codegen.workflow to generate the appropriated CPU-GPU code:
    • Open the Codegen.workflow
    • Add a new Task vertex to your workflow and name it SCAPE. To do so, simply click on “Task” in the Palette on the right of the editor then click anywhere in the editor.
    • Select the new task vertex. In the “Basic” tab of its “Properties”, set the value of the field “plugin identifier” to “scape.task.identifier”.
    • In the “Task Variables” tab of its “Properties”, fill the varaible like this:
    Parameter Value Comment
    Level number 0 Corresponds to the hierarchical level to coarsely cluster, works with SCAPE mode 0 and 1.
    SCAPE mode 0 0: match data parallelism to the target on specified level, 1: match data and pipeline parallelism to the target on specified level, 2: match data and pipeline parallelism to the target all admissible level.
    Stack size 1000000 Cluster-internal buffers are allocated statically up to this value, then dynamically.

(more detail, see Workflow Tasks Reference )

  • connect the new task to the “scenario” task vertex and to the “PiMM2SrDAG” task vertex as shown in the figure below.
    [here insert a figure]
  • Right-click on the workflow “/Workflows/Codegen.workflow” and select “Preesm > Run Workflow”;

The workflow execution generates intermediary dataflow graphs that can be found in the “/Algo/generated/” directory. The C code generated by the workflow is contained in the “/Code/generated/” directory.

Execution of the DiFX correlator dataflow model

Execution on laptop equiped with GPU

  • Check if your laptop is equiped with Nvidia GPU : lspci | grep -i nvidia
    Prompt:
    3D controller: NVIDIA Corporation GA107M [GeForce RTX 2050] (rev a1)
  • Install NVIDIA CUDA Drivers & CUDA Toolkit for your system, see NVIDIA’s tutorial
  • Check CUDA compiler & CUDA install with nvcc -V
    Prompt:
    → nvcc: NVIDIA (R) Cuda compiler driver
    → Copyright (c) 2005-2024 NVIDIA Corporation
    → Built on Thu_Mar_28_02:18:24_PDT_2024
    → Cuda compilation tools, release 12.4, V12.4.131
    → Build cuda_12.4.r12.4/compiler.34097967_0

A MakeFile is stored on the /Code folder

  • correct the IPP path to your IPP path
  • Compile and run
    cmake .
    make
    ./difx
    

Execution on Grid5000

The example here are for rennes.

  • Choose the node you need based on its characteristics (number of GPUs, memory …) presented in Rennes:Hardware - grid5000.
    Check the availability of your chosen node on Rennes:node or Rennes:node(production)
    (more status here)
    # ssh connect
    ssh orenaud@access.grid5000.fr
    ssh rennes
    #connect 1 abacus node (they host NVIDIA GPU)
    oarsub -q production -p abacus1 -I
    #copy the folder
    scp -r ~/path/Code orenaud@access.grid5000.fr:rennes
    

    A MakeFile is stored on the /Code folder

  • correct the IPP path to your IPP path (download IPP on the machine)
  • Compile and run
    cmake .
    make
    ./difx
    

    DiFX execution generates a file containing visibility information vis.out

Display results

You can visualize your data with the notebook available here.
Import your vis.out file and run the code.

Display of the 4 average caracheristics by baselines (detailed in the notebook).

Nota Bene:
To run the original version of FxCorr and Gcorr follow the readme of the DiFX dataflow model stored on preesm-apps.

References

[1] E. Michel, O. Renaud, A. Deller, K. Desnos, C. Phillips, J.-F. Nezan, Static Dataflow Synthesis for Heterogeneous CPU-GPU systems, IETR, , Swinburne, CSIRO, 202_.

[2] A.T. Deller, S.J. Tingay, M. Bailes, & C. West, DiFX: A software correlator for very long baseline interferometry using multi-processor computing environments, Swinburne, 2007.

[3] A.T. Deller, W.F. Brisken, C.J. Phillips, J. Morgan, W. Alef, R. Cappallo, E.Middelberg, J. Romney, H. Rottmann, S.J. Tingay, R. Wayth
DiFX2: A more flexible, efficient, robust and powerful software correlator,Swinburne, 2011
.

gcorr

preesm-apps

Acknowledgement

This project has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement No 873120.

Updated: