Supercomputing Modelling Final Exam Project. Download the report.

Check out the git repo!

Calculation of Density in the SPH Method

medium Comparison of the density distribution for different values of the $\alpha$ parameter.

Density formula:

$$\rho_{ij} = \sum_{k = 0}^M \rho_k e^{-\alpha \sqrt{ (x_{ij} - x_k)^2 + (y_{ij} - y_k)^2 + (z_{ij} - z_k)^2}}$$

$\rho_k, x_k, y_k, z_k$ --- input data particles density and coordinates

$M = 65536$ --- number of particles in input data.

$x_{ij} = -R + 2R \frac{i}{n_x - 1}$

$y_{ij} = 0$

$z_{ij} = -R + 2R \frac{j}{n_z - 1}$

MPI Implementation

Code

Inside main:

        for (int k = 0; k < 8; k++) {
            // Size of total grid
            int n = n_arr[k];

            // Size of partial grid for each proccess
            int pn = n / size;

            if (n % size != 0)
                pn++;

            double *grid, *partial_grid;

            grid  = new double [pn * size * n];
            partial_grid  = new double [pn * n];

            for (int p = 0; p < 4; p++) {
                double alpha = alpha_arr[p];

                gettimeofday(&t1, 0);
                get_grid(grid, partial_grid, data, pn, n, alpha);
                gettimeofday(&t2, 0);
                
                if (rank == 0) {
                    time = (1.0e6 * (t2.tv_sec - t1.tv_sec) +
                            t2.tv_usec - t1.tv_usec) / 1000.0;
                    cout << n << " " << time << endl;
                    print_output(grid, n, p);
                }
            }

            delete [] grid;
            delete [] partial_grid;
        }

Function 'get_grid':

        void get_grid(double *grid, double *partial_grid, double *data,
            int pn, int n, double alpha) {
            int rank;
            MPI_Comm_rank(MPI_COMM_WORLD, &rank);
            
            for (int i = 0; i < pn; i++)
                for (int j = 0; j < n; j++)
                    partial_grid[i * n + j] = 0;

            for (int i = 0; i < pn; i++) {
                for (int j = 0; j < n; j++) {
                    double xij = get_coord((double)(i + rank * pn),
                            (double)n);
                    double zij = get_coord((double)j, (double)n);

                    for (int k = 0; k < N; k++) {
                        double x       = data[k * 4 + 0];
                        double y       = data[k * 4 + 1];
                        double z       = data[k * 4 + 2];
                        double density = data[k * 4 + 3];

                        partial_grid[i * n + j] += density * exp(-alpha *
                                sqrt((x - xij) * (x - xij) +
                                     (z - zij) * (z - zij) +
                                     (y - 0)   * (y - 0)));

                    }
                }
            }
            MPI_Gather(partial_grid, pn * n, MPI_DOUBLE, grid, pn * n,
                    MPI_DOUBLE, 0, MPI_COMM_WORLD);
        }

Time Analysis

$$T_{paral}(P, N) = T_{arithm}(N, P) + T_{comm}(N, P)$$

$$T_{arithm}(N, P) = N \cdot \frac{N}{P} \cdot M \cdot \underbrace{\tau \cdot T}_{\text{cost of arighmetic operations inside loop}}$$

$M = 65536$ --- number of particles in input data.

$T$ --- number of arithmetic operations.

$\tau$ --- time for 1 arithmetic operation.

$$T_{comm}(N, P) = \left( \alpha + \frac{1}{\beta} \frac{N^2}{P} \right) \underbrace{\sqrt{p} }_{\text{torus}} $$

$$S(N, P) = \frac{T_{arithm}(N, P)}{T_{paral}(N, P)}$$

$$E(N, P) = \frac{S(N, P)}{P}$$

medium Constants found through fitting.

Speedup and Efficiency

medium Speedup and efficiency of the MPI implementation

OpenACC Implementation

Code

        #pragma acc data copyin(data[0:N*4]) copyout(grid[0:n*n])
        {
            #pragma acc parallel loop
            for (int i = 0; i < n; i++) {
                #pragma acc loop
                for (int j = 0; j < n; j++) {
                    double xij = -R + (double)i / (double)(n - 1) * 2 * R;
                    double zij = -R + (double)j / (double)(n - 1) * 2 * R;

                    double cell_ij = 0;

                    #pragma acc loop reduction(+:cell_ij)
                    for (int k = 0; k < N; k++) {
                        double x       = data[k * 4 + 0];
                        double y       = data[k * 4 + 1];
                        double z       = data[k * 4 + 2];
                        double density = data[k * 4 + 3];
                        cell_ij += density * exp(-alpha *
                                sqrt((x - xij) * (x - xij) +
                                     (z - zij) * (z - zij) +
                                     (y - 0)   * (y - 0)));
                    }
                    grid[i * n + j] = cell_ij;
                }
            }
        }

Compiling: Unified Memory

medium

pgc++ -lstdc++ -O2 -Wall -std=c++11 -acc -ta=nvidia:managed -Minfo=accel

The 'managed' option lets us use Unified Memory.

Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

Time Metrics and Comparison between MPI and OpenACC Implementations

medium Dependence of the run time at maximum optimization for different values of $n_x \cdot n_z$ for MPI and OpenACC implementation.

medium Dependence of the run time at maximum optimization for small values of $n_x \cdot n_z$ for MPI and OpenACC implementation.