Supercomputing Modelling Final Exam Project.
Download the report.
Check out the git repo!
Calculation of Density in the SPH Method Comparison of the density distribution for different values
of the $\alpha$ parameter. Density formula:
$$\rho_{ij} = \sum_{k = 0}^M \rho_k e^{-\alpha \sqrt{
(x_{ij} - x_k)^2 + (y_{ij} - y_k)^2 + (z_{ij} - z_k)^2}}$$$\rho_k, x_k, y_k, z_k$ — input data particles density and coordinates
$M = 65536$ — number of particles in input data.
$x_{ij} = -R + 2R \frac{i}{n_x - 1}$
$y_{ij} = 0$
$z_{ij} = -R + 2R \frac{j}{n_z - 1}$
MPI Implementation Code Inside main:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
for (int k = 0; k < 8; k++) {
// Size of total grid
int n = n_arr[k];
// Size of partial grid for each proccess
int pn = n / size;
if (n % size != 0)
pn++;
double *grid, *partial_grid;
grid = new double [pn * size * n];
partial_grid = new double [pn * n];
for (int p = 0; p < 4; p++) {
double alpha = alpha_arr[p];
gettimeofday(&t1, 0);
get_grid(grid, partial_grid, data, pn, n, alpha);
gettimeofday(&t2, 0);
if (rank == 0) {
time = (1.0e6 * (t2.tv_sec - t1.tv_sec) +
t2.tv_usec - t1.tv_usec) / 1000.0;
cout << n << " " << time << endl;
print_output(grid, n, p);
}
}
delete [] grid;
delete [] partial_grid;
}
Function ‘get_grid’:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
void get_grid(double *grid, double *partial_grid, double *data,
int pn, int n, double alpha) {
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
for (int i = 0; i < pn; i++)
for (int j = 0; j < n; j++)
partial_grid[i * n + j] = 0;
for (int i = 0; i < pn; i++) {
for (int j = 0; j < n; j++) {
double xij = get_coord((double)(i + rank * pn),
(double)n);
double zij = get_coord((double)j, (double)n);
for (int k = 0; k < N; k++) {
double x = data[k * 4 + 0];
double y = data[k * 4 + 1];
double z = data[k * 4 + 2];
double density = data[k * 4 + 3];
partial_grid[i * n + j] += density * exp(-alpha *
sqrt((x - xij) * (x - xij) +
(z - zij) * (z - zij) +
(y - 0) * (y - 0)));
}
}
}
MPI_Gather(partial_grid, pn * n, MPI_DOUBLE, grid, pn * n,
MPI_DOUBLE, 0, MPI_COMM_WORLD);
}
Time Analysis $$T_{paral}(P, N) = T_{arithm}(N, P) + T_{comm}(N, P)$$$$T_{arithm}(N, P) =
N \cdot \frac{N}{P} \cdot M \cdot
\underbrace{\tau \cdot T}_{\text{cost of arighmetic operations inside loop}}$$$M = 65536$ — number of particles in input data.
$T$ — number of arithmetic operations.
$\tau$ — time for 1 arithmetic operation.
$$T_{comm}(N, P) = \left( \alpha + \frac{1}{\beta}
\frac{N^2}{P} \right)
\underbrace{\sqrt{p} }_{\text{torus}}
$$$$S(N, P) = \frac{T_{arithm}(N, P)}{T_{paral}(N, P)}$$$$E(N, P) = \frac{S(N, P)}{P}$$Constants found through fitting. Speedup and Efficiency Speedup and efficiency of the MPI implementation. OpenACC Implementation Code 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#pragma acc data copyin(data[0:N*4]) copyout(grid[0:n*n])
{
#pragma acc parallel loop
for (int i = 0; i < n; i++) {
#pragma acc loop
for (int j = 0; j < n; j++) {
double xij = -R + (double)i / (double)(n - 1) * 2 * R;
double zij = -R + (double)j / (double)(n - 1) * 2 * R;
double cell_ij = 0;
#pragma acc loop reduction(+:cell_ij)
for (int k = 0; k < N; k++) {
double x = data[k * 4 + 0];
double y = data[k * 4 + 1];
double z = data[k * 4 + 2];
double density = data[k * 4 + 3];
cell_ij += density * exp(-alpha *
sqrt((x - xij) * (x - xij) +
(z - zij) * (z - zij) +
(y - 0) * (y - 0)));
}
grid[i * n + j] = cell_ij;
}
}
}
Compiling: Unified Memory Speedup and efficiency of the MPI implementation. pgc++ -lstdc++ -O2 -Wall -std=c++11 -acc -ta=nvidia:managed -Minfo=accel
The ‘managed’ option lets us use Unified Memory.
Unified Memory creates a pool of managed memory that is shared between
the CPU and GPU, bridging the CPU-GPU divide. Managed memory is
accessible to both the CPU and GPU using a single pointer. The key is
that the system automatically migrates data allocated in Unified Memory
between host and device so that it looks like CPU memory to code running
on the CPU, and like GPU memory to code running on the GPU.
Time Metrics and Comparison between MPI and OpenACC Implementations Dependence of the run time at maximum optimization for different
values of $n_x \cdot n_z$ for MPI and OpenACC
implementation. Dependence of the run time at maximum optimization for small values of
$n_x \cdot n_z$ for MPI and OpenACC
implementation.