Free AI web copilot to create summaries, insights and extended knowledge, download it at here

19932

Abstract

R.mtx nlpkkt80.mtx Cube_Coup_dt0.mtx roadNet-PA.mtx ML_Laplace.mtx bcsstk17.mtx mhda416.mtx af_1_k101.mtx thermal1.mtx thermomech_TK.mtx rma10.mtx cage4.mtx cant.mtx dc1.mtx raefsky2.mtx lpi_itest6.mtx rdist2.mtx mcfe.mtx olm1000.mtx lung2.mtx webbase-1M.mtx mhd4800a.mtx west2021.mtx thermal2.mtx adder_dcop_32.mtx mac_econ_fwd500.mtx FEM_3D_thermal1.mtx amazon0302.mtx cop20k_A.mtx olafu.mtx can_24.mtx af23560.mtx -->161070 8185136 50.8 92 False
1062400 28704672 27.0 28 True
2164760 127206144 58.8 68 True
1090920 3083796 2.8 9 True
377002 27689972 73.4 74 False
10974 428650 39.1 150 True
416 8562 20.6 33 False
503625 17550675 34.8 35 True
82654 574458 7.0 11 True
102158 711558 7.0 10 True
46835 2374001 50.7 145 False
9 49 5.4 6 False
62451 4007383 64.2 78 True
116835 766396 6.6 114190 False
3242 294276 90.8 108 False
11 29 2.6 3 False
3198 56934 17.8 61 False
765 24382 31.9 81 False
1000 3996 4.0 6 False
109460 492564 4.5 8 False
1000005 3105536 3.1 4700 False
4800 102252 21.3 33 False
2021 7353 3.6 12 False
1228045 8580313 7.0 11 True
1813 11246 6.2 100 False
206500 1273389 6.2 44 False
17880 430740 24.1 27 False
262111 1234877 4.7 5 False
121192 2624331 21.7 81 True
16146 1015156 62.9 89 True
24 160 6.7 9 True
23560 484256 20.6 21 False </pre></div>There is a lot of diversity and sparsity pattern in the matrix which will be helpful to compare and understand the behavior and benefit of each storage forma in CUDA and OpenMP. The collection of matrices also includes some large matrices such as Cube_Coup_dt0.mtx which has more than 127 million non-zero elements and more than one million rows. In the result part for clarity result several matrices are selected and will be presented as the result; the choice was made to have a diversity of sparsity patterns.To understand the sparsity difference and specificity of each matrix we can visualize some of them.<figure id="a0e8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*0AGdZIA24Zkjuf6QyuxjGg.png"><figcaption>Figure 8: Sparsity of Bcsstk17.mtx</figcaption></figure><figure id="e2ac"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*x6IIBgk2Cvo1Rp7KNg8cqQ.png"><figcaption>Figure 9: Sparsity of cavity10.mtx, Figure 10: Sparsity of can_24.mtx, Figure 11: Sparsity of cage4.mtx</figcaption></figure><h2 id="e4ee">Hardware description</h2>The HPC framework employed is Crescent, a research establishment located at Cranfield University. Our study will rely on the Tesla K40m as the primary device. The device comprises 15 multiprocessors, each featuring 192 CUDA cores, totaling 2,880 CUDA cores. The device possesses a 384-bit memory bus width, rendering a total of 12,207 MBytes of global memory accessible and 32 registers. To process large datasets, the Tesla K40m can handle different dimensions of BlockSize. Furthermore, the device can support a maximum number of threads per block of 1024 and has a warp size of 32.<h2 id="2b7f">Experiments</h2>Numerous configurations, loop unrolling techniques, and parallelization methods will be carried out in the following result part to analyze the effect and understand the impact on memory, cache, and efficiency of each of them. The CSR and ELLPACK format will be compared during the runs. The computation speed will be computed using GFlops, which is a performance metric in term of floating points operations, the GFlops are calculated as follow:<figure id="8b7e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*a0Y0VVRt81Y5lbi-Incuyg.png"><figcaption></figcaption></figure>On CPU the SpMV will be computed using OpenMP on 1,2,4,8 and 16 threads for several unrolling methods: horizontal unrolling — where one row is processed at a time by accessing several consecutive elements at the same time — and vertical unrolling — where several rows are processed at the same time in parallel. Each unrolling method will be tested with different levels of parallelism, 2,4,8, and 16.On GPUs the SpMV will be computed using CUDA on a single GPU for several parallelization methods: 1D Block and 1D grid, 2D Block and 1D grid where each row of the block computes a single line in parallel. For ELLPACK format, additional analysis will be carried out where JA and AS 2D arrays are transposed, and each row is computed using a single thread using a 1D Block and 1D grid. The transposed array will allow the parallelization to have a place without any memory access conflict between threads — because C languages are row access majors.<figure id="41f3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*uY27lm5_6uM8LELJRa7BXA.png"><figcaption>Figure 12: Row and Column major access to memory</figcaption></figure><h1 id="ae26">Results</h1><h2 id="4eb9">OpenMP</h2><h2 id="ed04">Algorithms and storage formats</h2>As described in the previous part, several methods of unrolling were implemented — horizontal and vertical unrolling — using different levels of parallelization — levels 2, 4, 8, and 16. The graphs Figure 13–15 and the heatmap Figure 14 show the implementation result using OpenMP.<figure id="6ae3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*mkrp9x-BXSKNuBH07szLfQ.png"><figcaption>Figure 13: OpenMP, the effect of Algorithms on GFlops for 16 Threads</figcaption></figure><figure id="4bfb"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*UGroV98eN7iTVtM90crfgQ.png"><figcaption>Figure 15: OpenMP, Average GFlops for 16 Threads, Figure 14: OpenMP, Heatmap of GFlops per Algorithm</figcaption></figure>These results clearly show us that the CSR format is more efficient, and the matrices have higher GFlops using CSR than ELLPACK. Moreover, there is no significant improvement when increasing the level of unrolling. The best algorithm found is with an unrolling of level 2. This can be explained by the limit of registers inside the CPU, the more unrolling we want to implement the more register we will use to store the temporary variable to execute the computation. If the limit of the register is hit, there will not be any significant improvement in time, and it can even lead to a decrease in the efficiency of the kernel by using less efficient memory allocation and access.Secondly, horizontal unrolling has better results than vertical unrolling where several rows are accessed at the same iteration. This can be explained by the behavior of the cache inside the hardware and the access of elements in C languages.One interesting notice is the fact that with some matrices there is a significant difference between the ELLPACK and CSR formats. For example, with dc1.mtx the average CSR GFlops is 6.01 whereas with ELLPACK it is 0.83. These significant differences came from the sparsity of this matrix with an NZ_max of 114190 and an Avg-N of 6.6 as described in Table 1, making the storage in ELLPACK format less efficient as it needs to store not essential elements.<h2 id="f984">Threads and NZ elements</h2>The effect of NZ elements and the number of threads used with OpenMP were analyzed for a second time, as represented by the result shown in Figure 16. The result as expected shows us that the GFlops grows as the number of threads is growing. The computation performs more efficient results with more threads. There is also a light correlation between the GFlops and the number of NZ elements independently of the sparsity of the matrix.Finally, the best algorithm for unrolling is a level 2 horizontal unrolling followed by a level two vertical unrolling as represented in Table 2. Note that all the source code was compiled using the flag -O3 which indicates to the compiler to optimize the code as much as possible. The compiler can then execute additional unrolling methods when compiling the source code, this can also explain the performance of the code without unrolling or the homogeneous result with different unrolling techniques as the compiler rearranges the source code at compile time to optimize it.<figure id="5f00"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*OE5o-3VpPhDfxrWGtTwaZA.png"><figcaption>Figure 16: OpenMP, Effect of Threads and NZ Elements</figcaption></figure><div id="77d6"><pre>Table 2: OpenMP, Best unrolling method per number of threads

Nbr Thread 1 2 4 8 16
------------ ---- ---- ---- --- ---- C_2H 12 11 11 8 7
C_2V 2 4 3 3 2
C_4H 2 2 3 2 2
C_4V 1 3 0 0 2 </pre></div>Finally, the matrices such as Cube_Coup_dt0, cavity10, ML_Laplace, and af_1_k101 achieved around 5 speedups for 16 threads using OpenMP and an unrolling horizontal of level 2, as represented in Figure 17.<figure id="2b3b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*cfa15QK5v1buE_yLNvbHIQ.png"><figcaption>Figure 17: OpenMP Speedups</figcaption></figure><h2 id="f57e">CUDA</h2><h2 id="52c8">Algorithms and storage formats</h2>For CUDA several methods of parallelization were tested, in the first place the matrices were computed using 1D and 2D Block inside a 1D Grid where each thread in the row of the block computes the same row in the matrix and then reduces the result into the final vector. Each method was tested with different methods of unrolling and using CSR and ELLPACK format. The result of the following tests is shown in Figures 18 & 20 and on the heatmap Figure 19.These results clearly show us the efficiency of using a 2D block compared to the 1D Block and that for all types of matrix sparsity, symmetric or not. As explained in the methodology part unrolling is not efficient in CUDA and we didn’t achieve a better result by using an unrolling methodology. One of the interesting distinctions is that the ELLPACK format is more efficient than the CSR format, which wasn’t the case for OpenMP<figure id="5566"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ME0kiXB28vgkhYvbDvhEUw.png"><figcaption>Figure 18: CUDA, the effect of Algorithms on GFlops for a BlockSize of 16</figcaption></figure><figure id="8ba6"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*2t912fOiLBSu7vC7JWd5ng.png"><figcaption>Figure 20: CUDA, Average GFlops for a BlockSize of 16, Figure 19: CUDA, Heatmap of GFlops per Algorithm</figcaption></figure><h2 id="fed5">BlockSize and NZ elements</h2>The effect of NZ elements is the same as with OpenMP, there is a correlation between the NZ elements and the GFlops as represented in Figure 21. We can also observe the difference between several sizes of a block. The most efficient BlockSize is 16 and the worst 64, which is not initially the first thing in mind because naturally, we might think that with more threads, we will be able to distribute the work more evenly and be more efficient. But the reality here is that we need to reduce the result into a single vector so the reduction needs synchronization between threads which might lead to a less efficient algorithm for too many threads. Another thing to get in mind is the fact that with more threads the shared memory between threads is bigger.<figure id="859e"><img sr

Options

c="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xe-VVKn6qSGQGKSggQ_3eQ.png"><figcaption>Figure 21: CUDA, Effect of BlockSize and NZ Elements</figcaption></figure><h2 id="f4b3">ELLPACK Transposed</h2>Finally, the last experiment was to transpose the JA and AS arrays of the ELLPACK format and try a different parallelization methodology using CUDA where only one thread will compute each row of the matrix. In this way, we can take advantage of the row access major of C++.<div id="964b"><pre>Table 3: CUDA, Comparison between ELLPACK transposed or not

                <span class="hljs-operator">|</span>           Not Transposed       <span class="hljs-operator">|</span> Transposed <span class="hljs-operator">|</span>

------------------- ------------ --------- --------- ----------- ----------- Matrix 1Thread Best 1D Best 2D 1Thread T Best
------------------- ------------ --------- --------- ----------- ----------- rdist2.mtx 0.76 2.65 4.95 1.39 Best 2D
cavity10.mtx 0.71 4.02 6.36 1.84 Best 2D
raefsky2.mtx 0.91 9.34 14.01 4.33 Best 2D
bcsstk17.mtx 1.52 5.79 12.80 4.96 Best 2D
lung2.mtx 4.19 1.17 7.64 11.87 1Thread T
thermal1.mtx 3.10 1.69 8.09 11.61 1Thread T
thermomech_TK.mtx 3.23 1.68 6.44 7.95 1Thread T
rma10.mtx 1.56 7.51 17.33 13.41 Best 2D
roadNet-PA.mtx 3.72 0.71 4.17 12.69 1Thread T
cant.mtx 1.51 8.74 18.90 18.38 Best 2D
PR02R.mtx 1.54 7.32 17.85 18.44 1Thread T
af_1_k101.mtx 1.78 6.65 20.40 25.34 1Thread T
ML_Laplace.mtx 1.59 10.17 23.08 24.82 1Thread T
nlpkkt80.mtx 2.01 5.96 16.81 30.97 1Thread T
Cube_Coup_dt0.mtx 1.59 9.84 21.24 26.59 1Thread T </pre></div>The results shown in Table 3 and Figure 22 are interesting because it shows that using only 1 Thread per row is not efficient at all, but when the arrays are transposed it became suddenly one of the most efficient methods. The difference in efficiently using the same amount of thread is impressive. By transposing the arrays, it’s in most cases more efficient to use only 1 thread by row than several.This drastic difference in efficiency, while using the same amount of threads, is because when the matrix is stored in the ELLPACK format, coefficients within the same column are stored adjacent to each other in memory. Therefore, to ensure consecutive threads access consecutive entries, the matrix should be transposed in memory. By transposing the arrays, the memory access pattern is optimized, leading to better cache utilization and memory coalescing. As a result, in most cases, it is more efficient to use only one thread per row when the matrix is transposed.In C++, memory allocation for 2D arrays is conventionally arranged in row-major order, such that each row of the array is stored in contiguous memory locations. However, when performing matrix-vector multiplication in CUDA, where each thread block is typically assigned to a set of rows in the matrix, conflicts may come if multiple threads attempt to access memory locations that are closely situated in memory. Such conflicts may arise if the non-zero elements in a row are scattered throughout the array instead of being located in a contiguous block.To arrange this issue, transposing the matrix can be beneficial. By transposing the matrix, the non-zero elements in each column are brought together into a contiguous block. This reduces the likelihood of thread conflicts, thereby enabling threads to access data with greater efficiency and minimizing the risk of stalling due to memory access. The reduced contention for memory access can lead to improved performance in CUDA, as thread conflicts can result in delays that can slow down the overall computation.<figure id="ef6a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*m1-cpYwAheCQ73mup5M3zw.png"><figcaption>Figure 22: CUDA, Comparison of parallelization methods</figcaption></figure><h2 id="dcd3">CUDA vs OpenMP</h2>To end this result part the difference between CUDA and OpenMP was analyzed as presented in Table 4 and Figures 24 & 23. The result in the table and graphs are sorted by NZ elements because the result indicates that OpenMP is more efficient for smaller matrices and CUDA for big matrices with a large number of NZ as Cube_Coup_dt0.This difference between CUDA and OpenMP can be explained by the limited number of threads inside a CPU, so OpenMP is limited in its scaling methodology. However, the CPU inside a CPU is more powerful and more efficient. The paradigm in CUDA is the opposite as in a CPU where the goal is to parallelize between a huge number of threads less powerful than the CPU. In addition, the split between CUDA and OpenMP seems to be around 2 500 000 NZ elements, lower than that OpenMP is better, otherwise it is CUDA.<div id="403d"><pre>Table 4: CUDA vs OpenMP

   Matrix            NZ       CUDA    OpenMP    Best

------------------- ----------- ------- -------- -------- rdist2.mtx 56934 4.95 11.37 OpenMP
cavity10.mtx 76367 6.36 13.93 OpenMP
raefsky2.mtx 294276 14.01 16.79 OpenMP
bcsstk17.mtx 428650 12.80 21.40 OpenMP
lung2.mtx 492564 11.87 20.56 OpenMP
thermal1.mtx 574458 11.61 16.91 OpenMP
thermomech_TK.mtx 711558 7.95 11.57 OpenMP
rma10.mtx 2374001 17.33 27.93 OpenMP
roadNet-PA.mtx 3083796 12.69 2.09 CUDA
cant.mtx 4007383 18.90 8.16 CUDA
PR02R.mtx 8185136 18.44 10.03 CUDA
af_1_k101.mtx 17550675 25.34 9.72 CUDA
ML_Laplace.mtx 27689972 24.82 9.11 CUDA
nlpkkt80.mtx 28704672 30.97 6.93 CUDA
Cube_Coup_dt0.mtx 127206144 26.59 9.40 CUDA </pre></div><figure id="0a44"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*LQBqtCuGN4Dj5pmvgYnJ4A.png"><figcaption>Figure 24: CUDA vs OpenMP by matrix, Figure 23: CUDA vs OpenMP by NZ elements</figcaption></figure>One final observation is the logarithmic linear regression that can be observed with the CUDA result between the GFlops and the NZ elements in Figure 25. This relationship shows us the capability of CUDA to scale efficiently for larger matrices.<figure id="a3e9"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8LXfoDRlhc9a2gC_3_zj0w.png"><figcaption>Figure 25: Linear regression on CUDA parallelization by NZ elements</figcaption></figure><h1 id="fac7">Conclusion</h1>To conclude, the results of the OpenMP implementation reveal that the Compressed Sparse Row (CSR) format outperforms the ELLPACK format, and that horizontal unrolling is more efficient than vertical unrolling. However, increasing the level of unrolling does not appear to have a significant impact on the efficiency of the kernel, as it may lead to suboptimal memory allocation and access. The GFlops exhibit a positive correlation with the number of threads utilized with OpenMP, and the impact of non-zero elements (NZ) is also evident. The optimal algorithm for unrolling is a level two horizontal unrolling, followed by a level two vertical unrolling, with some matrices achieving around 5 speedups for 16 threads using OpenMP and a level two horizontal unrolling.On the other hand, the CUDA implementation indicates that a two-dimensional (2D) block is more effective than a one-dimensional (1D) block and that the ELLPACK format is more efficient than the CSR format. However, unrolling is not efficient in CUDA, and the most effective BlockSize is 16, while the worst is 64. There is a correlation between the GFlops and the number of non-zero (NZ) elements, and the reduction necessitates synchronization between threads, which may result in an inefficient algorithm for too many threads. Lastly, by transposing the JA and AS arrays of the ELLPACK format and employing a distinct parallelization methodology using CUDA, a more efficient algorithm can be achieved.Find more of the code <a href="https://github.com/sferez/SSPP_Sparse_Matrix_CUDA">here</a>.<h1 id="d133">References</h1>[1] T. Davis. Wilkinson’s sparse matrix definition (2007)[2] Salvatore Filippone, Valeria Cardellini, Davide Barbieri Alessandro Fanfarillo (2017), Sparse Matrix-Vector Multiplication on GPGPUs[3] Nathan Bell, Michael Garland (2008), Efficient Sparse Matrix-Vector Multiplication on CUDA[4] John Mellor-Crummey, John Garvin (2003), Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam[5] Salvatore Filippone (2023), Small Scale Parallel Programming, Cranfield University<h1 id="32af">Learn more</h1><a href="https://medium.com/@simeon.ferez/solving-the-traveling-salesman-problem-with-parallel-computing-305f8324515d">Solving the Traveling Salesman Problem with Parallel Computing</a><figure id="f7d1"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*6UPad3BCSuxOCNaE.png"><figcaption></figcaption></figure><a href="https://readmedium.com/multiple-linear-regression-gradient-descent-python-a19d5c41aeae">Multiple Linear Regression, Gradient Descent /w Python</a><figure id="c14f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*NQOMEt5dmQphhzUG.png"><figcaption></figcaption></figure>© All rights reserved, April 2023, Siméon FEREZ</article></body>

Sparse Matrix-Vector Product: OpenMP vs CUDA on Hybrid Platforms

SMALL-SCALE PARALLEL PROGRAMMING

Dive into the world of small-scale parallel programming as we explore the performance of OpenMP and CUDA in handling sparse matrix-vector multiplication, uncovering the intricate relationship between programming models and matrix characteristics.

Literature review
Methodology
Results
Conclusion

Sparse matrix-vector multiplication (SpMV), a crucial operation in scientific computing, is a complex and dynamic field with many challenges that must be addressed to achieve optimal performance.

In this study, we examined the performance of OpenMP and CUDA for SpMV on a hybrid CPU and GPU platform. We implemented both programming models on a set of sparse matrices with varying sizes and densities.

Our findings revealed a relationship between the choice of programming model and the characteristics of the sparse matrix being used. Specifically, OpenMP outperformed CUDA on smaller matrices with a lower number of nonzero elements, whereas CUDA was more efficient for larger matrices with a higher density of nonzero elements.

Literature review

Sparse Matrices

Sparse matrix-vector multiplication (SpMV) is a fundamental operation that is extensively utilized in scientific and engineering domains, such as numerical simulations, machine learning, and data analysis. This operation involves multiplying a matrix that consists of zero elements by a vector, resulting in a linear combination of the matrix’s columns, weighted by the corresponding vector entries. The principal challenge in sparse matrix-vector multiplication is that the matrix is predominantly sparse, indicating that it contains a large number of zero elements.

As a matter of fact, a sparse matrix is defined by James Wilkinson [1] as follows:

“Any matrix with enough zeros that it pays to take advantage of them.”

In a numerical way it is translated by the number of Non-Zero elements (NZ) is way lower than the number of Rows by Columns: [2]

NZ ≪ M × N

Although this sparsity can be utilized to decrease computational cost, it also complicates the parallelization of the operation, as the workload is irregularly distributed across matrix elements.

In recent years, research has been devoted to the investigation and development of algorithms for sparse matrix-vector multiplication, with the SpMV kernel emerging as a preeminent and widely utilized approach. However, the growing scale of data sets in recent data-driven domains necessitates the exploration of more effective and scalable techniques. To address this challenge, parallel programming frameworks, notably OpenMP and CUDA, have been under research as a means of gaining superior performance in SpMV.

Sparse matrix computations are challenging, and the performance is often low on a computer cluster. In fact, sparse matrices necessitate an explicit representation of the coordinates of non-zero elements, which increases the memory bandwidth required to manipulate them during computation. Moreover, sparse matrix computations need less reuse of values than dense matrices, making it challenging to use registers and caches effectively. In addition, the loop structure of a sparse matrix-vector product computation is less regular than that of a dense matrix-vector product computation, leading to less efficient compiler-generated code. [4]

To address these challenges techniques have been explored, including the development of various sparse matrix representations, matrix reordering, the use of multiple representations in concert, cache and register blocking, and combinations of these approaches. By employing these techniques, it is possible to improve the performance of sparse matrix computations, thereby enabling their application in a broad range of scientific and engineering domains.

CUDA and OpenMP

Parallel programming frameworks, such as OpenMP and CUDA, have demonstrated significant potential in accelerating the performance of sparse matrix-vector multiplication. OpenMP is a shared-memory parallel programming model that provides an effective means of utilizing multicore processors in a shared memory environment, making it suitable for parallelism including sparse matrix-vector multiplication.

In contrast, CUDA is a parallel programming framework developed by NVIDIA, which is explicitly designed for use with GPUs, enabling the massive parallelism power of modern GPUs containing thousands of processing threads that can be efficiently used in a parallel manner. As Nathan Bell and Michael Garland [3] point out, the CUDA programming model can deliver high performance and scalability for SpMV, particularly for matrices of substantial size.

Previous studies have compared the performance of OpenMP and CUDA for sparse matrix-vector multiplication and concluded that CUDA is preferable for larger-scale problems, while OpenMP is more appropriate for smaller problems and for systems with relatively few cores.

The utilization of OpenMP and CUDA parallel programming frameworks proved to be effective in accelerating the performance of sparse matrix-vector multiplication. Selecting the optimal framework may depend on the specific requirements of the application, and a comprehensive understanding of the strengths and weaknesses of each framework is critical for achieving high performance and scalability.

Objectives

The objectives of the following reports are as follows:

Develop and test a parallel SpMV product kernel capable of computing y ← Ax, where A is a sparse matrix stored in either the CSR or ELLPACK format.
Parallelize the kernel to exploit available computing capabilities using both OpenMP and CUDA versions.
Develop auxiliary functions to preprocess the matrix data and represent it in the desired format.
Test the performance of the kernel on a specified set of matrices obtained from the Suite Sparse Matrix Collection.

Methodology

Matrix formats

During the project several format methods of the sparse matrix were used, notably to store the information on file, on data structure, and conduct the experiments. Note that from now each indexing will be C/C++ based which means that the first element will be indexed as 0 in an array.

COO

The Coordinate List (COO) format is a commonly used representation for sparse matrices. This format stores non-zero elements as a tuple of its row index, column index, and value as represented in Figure 1. The COO format is highly flexible and efficient for constructing and converting to other sparse matrix formats, which is why the matrices used in the following project are stored in COO format on files. When processing the matrices, the COO format will then be translated into CSR or ELLPACK which will be stored in different data structures inside the computer cluster. The memory used to store a sparse matrix in a COO format is three arrays of NZ elements.

CSR

The Compressed Storage by Row (CSR) format is a used technique for representing sparse matrices and improves computation capabilities. It relies on three one-dimensional arrays to store non-zero matrix values, column indices, and row pointers as represented in Figure 2. This format is optimized for parallel computation through OpenMP and CUDA frameworks, making it one of the preferred methods for sparse matrix-vector multiplication.

Despite its advantages, it’s important to note that the CSR format may not always be the most efficient representation for all types of sparse matrices. Irregular sparsity patterns may be problematic.

The matrix-vector product in CSR format is computed as follows with AS, JA, and IRP representing the sparse matrix, M the number of rows, x the input vector, y the output vector:

Figure 3: Pseudo-code of SpMV in CSR format:

function MatrixMultiply(AS: array of double, JA: array of int, IRP: array of int, M: int, x: array of double, y: array of double):
    for i from 0 to M-1 do
        t = 0
        for j from IRP[i] to IRP[i+1]-1 do
            t = t + x[JA[j]] * AS[j]
        end for
        y[i] = t
    end for
end function

ELLPACK

The ELLPACK sparse matrix representation has gained wide usage thanks to its storage and computational efficiency. Within the ELLPACK format, the matrix is stored in arrays that possess two dimensions. Each row of the array contains a set number of non-zero elements, with the non-zero values being stored in the values array, and the corresponding column indices being stored in the column index array as represented in Figure 4. The void entries in the matrix are filled with zeroes in AS and -1 in JA. This format is optimized for parallel computation, particularly for matrices with uniform sparsity patterns, and is well-suited for use with CUDA frameworks.

Compared to the CRS format, the ELLPACK format provides a fixed-length representation for each row, which can significantly improve computational efficiency by using caches and registers. Moreover, the ELLPACK format is more memory-efficient compared to the CSR format, particularly for matrices with uniform sparsity patterns.

However, it’s important to note that the ELLPACK format may not be the most efficient representation for matrices with irregular sparsity patterns. This may result in significant storage overhead and inefficient memory access. Additionally, the fixed-length representation for each row may not be optimal for matrices with variable sparsity patterns, where a lot of unnecessary data will be stored.

The matrix-vector product in ELLPACK format is computed as follows with AS and JA representing the sparse matrix, M the number of rows, max_nz the max number of non-zeros elements by row x the input vector, y the output vector:

Figure 5: Pseudo-code of SpMV in ELLPACK format

function MatrixMultiply(JA: array of int, AS: array of double, M: int, max_nz: int, x: array of double, y: array of double):
    for i from 0 to M-1 do
        t0 = 0.0
        for j from 0 to max_nz-1 do
            if JA[i][j] == -1 then
                break
            end if
            t0 = t0 + AS[i][j] * x[JA[i][j]]
        end for
        y[i] = t0
    end for
end function

Loop unrolling

Loop unrolling is a technique employed by compilers or directly inside the source code, that aims to reduce the execution time of programs by decreasing the number of loop executions. By replicating the loop body and increasing the loop counter by a larger value, it enables multiple iterations of the loop to be executed in one single cycle, mitigating the overhead of loop control instructions. Loop unrolling can be particularly advantageous on computer architectures that lack parallelism, such as CPUs with a limited number of threads.

The benefits of loop unrolling are numerous. By increasing the amount of work done per iteration and decreasing the number of iterations, loop unrolling can improve the utilization of the program’s instruction cache which contribute to faster execution times.

However, while loop unrolling can be a highly effective optimization technique for sequential code on CPUs with limited parallelism, it is not the optimal choice for parallel code operating on GPUs. In these situations, strategies like memory coalescing, thread blocking, and shared memory are utilized to improve performance on GPUs by exploiting parallelism on a larger scale. It is worth noting that loop unrolling increases the use of register on the architecture, which will probably create a rooftop of the max unrolling that is worth coding on specific hardware.

CUDA

The CUDA parallelization paradigm is founded on exploiting the large architecture of threads that the GPU offers. This is accomplished by launching several threads that operate simultaneously on the GPU. Each thread undertakes a specific computation on a small subset of data, which normally consists of a singular element or a diminutive block of elements derived from the input data. The paradigm of CUDA parallelization refers to parallelizing the problem on the most number of threads.

The CUDA programming paradigm provides a technique for organizing threads into a dual-tier hierarchy comprising thread blocks and grids. A block refers to a group of threads that can cooperate and synchronize through the usage of fast shared memory. A grid is an aggregation of blocks that operate independently and can only communicate through global memory. The most prevalent parallelization approach is 1D block and grid parallelization, where the data is divided into one-dimensional arrays, and a one-dimensional block and grid structure is utilized to parallelize computation, as represented in Figure 6.

Figure 6: 1D Grid of 1D Blocks parallelization in CUDA, Credit: Andreajeka

In certain circumstances, it may be more efficacious to employ a 2D block and grid structure, which could optimize memory access patterns and reduce thread divergence. 2D block and grid parallelization is especially useful in the field of image processing and other applications that operate on two-dimensional data structures, as represented in Figure 7.

Figure 7: 2D Grid of 2D Blocks parallelization in CUDA, Credit: Andreajeka

CUDA utilizes not only blocks and grids but also warps to execute threads in parallel on a singular processing unit. A warp comprises threads that are executed in unison, and the number of threads present in a warp typically stands at 32, based on the GPU architecture, it’s the smallest unit of work in CUDA. To ensure optimal performance, minimizing warp divergence is crucial. This phenomenon takes place when threads within a warp follow varying paths in the program flow, leading to some threads being idle while others are still executing. As a result, the execution time tends to prolong, thereby decreasing the overall performance.

Experimental Evaluation

The experiments will be carried out on several matrices of the SuiteSparse Matrix Collection of the University of Florida (https://sparse.tamu.edu/). The set of matrices analyzed is the following:

Table 1: Matrix analysis

      Matrix Name          M         NZ       Avg_NZ   Max_NZ   Symmetric  
 --------------------- --------- ----------- -------- -------- ----------- 
  cavity10.mtx             2597       76367     29.4       62   False      
  PR02R.mtx              161070     8185136     50.8       92   False      
  nlpkkt80.mtx          1062400    28704672     27.0       28   True       
  Cube_Coup_dt0.mtx     2164760   127206144     58.8       68   True       
  roadNet-PA.mtx        1090920     3083796      2.8        9   True       
  ML_Laplace.mtx         377002    27689972     73.4       74   False      
  bcsstk17.mtx            10974      428650     39.1      150   True       
  mhda416.mtx               416        8562     20.6       33   False      
  af_1_k101.mtx          503625    17550675     34.8       35   True       
  thermal1.mtx            82654      574458      7.0       11   True       
  thermomech_TK.mtx      102158      711558      7.0       10   True       
  rma10.mtx               46835     2374001     50.7      145   False      
  cage4.mtx                   9          49      5.4        6   False      
  cant.mtx                62451     4007383     64.2       78   True       
  dc1.mtx                116835      766396      6.6   114190   False      
  raefsky2.mtx             3242      294276     90.8      108   False      
  lpi_itest6.mtx             11          29      2.6        3   False      
  rdist2.mtx               3198       56934     17.8       61   False      
  mcfe.mtx                  765       24382     31.9       81   False      
  olm1000.mtx              1000        3996      4.0        6   False      
  lung2.mtx              109460      492564      4.5        8   False      
  webbase-1M.mtx        1000005     3105536      3.1     4700   False      
  mhd4800a.mtx             4800      102252     21.3       33   False      
  west2021.mtx             2021        7353      3.6       12   False      
  thermal2.mtx          1228045     8580313      7.0       11   True       
  adder_dcop_32.mtx        1813       11246      6.2      100   False      
  mac_econ_fwd500.mtx    206500     1273389      6.2       44   False      
  FEM_3D_thermal1.mtx     17880      430740     24.1       27   False      
  amazon0302.mtx         262111     1234877      4.7        5   False      
  cop20k_A.mtx           121192     2624331     21.7       81   True       
  olafu.mtx               16146     1015156     62.9       89   True       
  can_24.mtx                 24         160      6.7        9   True       
  af23560.mtx             23560      484256     20.6       21   False

There is a lot of diversity and sparsity pattern in the matrix which will be helpful to compare and understand the behavior and benefit of each storage forma in CUDA and OpenMP. The collection of matrices also includes some large matrices such as Cube_Coup_dt0.mtx which has more than 127 million non-zero elements and more than one million rows. In the result part for clarity result several matrices are selected and will be presented as the result; the choice was made to have a diversity of sparsity patterns.

To understand the sparsity difference and specificity of each matrix we can visualize some of them.

Figure 9: Sparsity of cavity10.mtx, Figure 10: Sparsity of can_24.mtx, Figure 11: Sparsity of cage4.mtx

Hardware description

The HPC framework employed is Crescent, a research establishment located at Cranfield University. Our study will rely on the Tesla K40m as the primary device. The device comprises 15 multiprocessors, each featuring 192 CUDA cores, totaling 2,880 CUDA cores. The device possesses a 384-bit memory bus width, rendering a total of 12,207 MBytes of global memory accessible and 32 registers. To process large datasets, the Tesla K40m can handle different dimensions of BlockSize. Furthermore, the device can support a maximum number of threads per block of 1024 and has a warp size of 32.

Experiments

Numerous configurations, loop unrolling techniques, and parallelization methods will be carried out in the following result part to analyze the effect and understand the impact on memory, cache, and efficiency of each of them. The CSR and ELLPACK format will be compared during the runs. The computation speed will be computed using GFlops, which is a performance metric in term of floating points operations, the GFlops are calculated as follow:

On CPU the SpMV will be computed using OpenMP on 1,2,4,8 and 16 threads for several unrolling methods: horizontal unrolling — where one row is processed at a time by accessing several consecutive elements at the same time — and vertical unrolling — where several rows are processed at the same time in parallel. Each unrolling method will be tested with different levels of parallelism, 2,4,8, and 16.

On GPUs the SpMV will be computed using CUDA on a single GPU for several parallelization methods: 1D Block and 1D grid, 2D Block and 1D grid where each row of the block computes a single line in parallel. For ELLPACK format, additional analysis will be carried out where JA and AS 2D arrays are transposed, and each row is computed using a single thread using a 1D Block and 1D grid. The transposed array will allow the parallelization to have a place without any memory access conflict between threads — because C languages are row access majors.

Figure 12: Row and Column major access to memory

Results

OpenMP

Algorithms and storage formats

As described in the previous part, several methods of unrolling were implemented — horizontal and vertical unrolling — using different levels of parallelization — levels 2, 4, 8, and 16. The graphs Figure 13–15 and the heatmap Figure 14 show the implementation result using OpenMP.

Figure 13: OpenMP, the effect of Algorithms on GFlops for 16 Threads

Figure 15: OpenMP, Average GFlops for 16 Threads, Figure 14: OpenMP, Heatmap of GFlops per Algorithm

These results clearly show us that the CSR format is more efficient, and the matrices have higher GFlops using CSR than ELLPACK. Moreover, there is no significant improvement when increasing the level of unrolling. The best algorithm found is with an unrolling of level 2. This can be explained by the limit of registers inside the CPU, the more unrolling we want to implement the more register we will use to store the temporary variable to execute the computation. If the limit of the register is hit, there will not be any significant improvement in time, and it can even lead to a decrease in the efficiency of the kernel by using less efficient memory allocation and access.

Secondly, horizontal unrolling has better results than vertical unrolling where several rows are accessed at the same iteration. This can be explained by the behavior of the cache inside the hardware and the access of elements in C languages.

One interesting notice is the fact that with some matrices there is a significant difference between the ELLPACK and CSR formats. For example, with dc1.mtx the average CSR GFlops is 6.01 whereas with ELLPACK it is 0.83. These significant differences came from the sparsity of this matrix with an NZ_max of 114190 and an Avg-N of 6.6 as described in Table 1, making the storage in ELLPACK format less efficient as it needs to store not essential elements.

Threads and NZ elements

The effect of NZ elements and the number of threads used with OpenMP were analyzed for a second time, as represented by the result shown in Figure 16. The result as expected shows us that the GFlops grows as the number of threads is growing. The computation performs more efficient results with more threads. There is also a light correlation between the GFlops and the number of NZ elements independently of the sparsity of the matrix.

Finally, the best algorithm for unrolling is a level 2 horizontal unrolling followed by a level two vertical unrolling as represented in Table 2. Note that all the source code was compiled using the flag -O3 which indicates to the compiler to optimize the code as much as possible. The compiler can then execute additional unrolling methods when compiling the source code, this can also explain the performance of the code without unrolling or the homogeneous result with different unrolling techniques as the compiler rearranges the source code at compile time to optimize it.

Figure 16: OpenMP, Effect of Threads and NZ Elements

Table 2: OpenMP, Best unrolling method per number of threads

  Nbr Thread   1    2    4    8   16  
 ------------ ---- ---- ---- --- ---- 
  C_2H         12   11   11   8    7  
  C_2V          2    4    3   3    2  
  C_4H          2    2    3   2    2  
  C_4V          1    3    0   0    2

Finally, the matrices such as Cube_Coup_dt0, cavity10, ML_Laplace, and af_1_k101 achieved around 5 speedups for 16 threads using OpenMP and an unrolling horizontal of level 2, as represented in Figure 17.

CUDA

Algorithms and storage formats

For CUDA several methods of parallelization were tested, in the first place the matrices were computed using 1D and 2D Block inside a 1D Grid where each thread in the row of the block computes the same row in the matrix and then reduces the result into the final vector. Each method was tested with different methods of unrolling and using CSR and ELLPACK format. The result of the following tests is shown in Figures 18 & 20 and on the heatmap Figure 19.

These results clearly show us the efficiency of using a 2D block compared to the 1D Block and that for all types of matrix sparsity, symmetric or not. As explained in the methodology part unrolling is not efficient in CUDA and we didn’t achieve a better result by using an unrolling methodology. One of the interesting distinctions is that the ELLPACK format is more efficient than the CSR format, which wasn’t the case for OpenMP

Figure 18: CUDA, the effect of Algorithms on GFlops for a BlockSize of 16

Figure 20: CUDA, Average GFlops for a BlockSize of 16, Figure 19: CUDA, Heatmap of GFlops per Algorithm

BlockSize and NZ elements

The effect of NZ elements is the same as with OpenMP, there is a correlation between the NZ elements and the GFlops as represented in Figure 21. We can also observe the difference between several sizes of a block. The most efficient BlockSize is 16 and the worst 64, which is not initially the first thing in mind because naturally, we might think that with more threads, we will be able to distribute the work more evenly and be more efficient. But the reality here is that we need to reduce the result into a single vector so the reduction needs synchronization between threads which might lead to a less efficient algorithm for too many threads. Another thing to get in mind is the fact that with more threads the shared memory between threads is bigger.

ELLPACK Transposed

Finally, the last experiment was to transpose the JA and AS arrays of the ELLPACK format and try a different parallelization methodology using CUDA where only one thread will compute each row of the matrix. In this way, we can take advantage of the row access major of C++.

Table 3: CUDA, Comparison between ELLPACK transposed or not

                    |           Not Transposed       | Transposed |                                             
 ------------------- ------------ --------- --------- ----------- ----------- 
  Matrix              1Thread      Best 1D   Best 2D   1Thread T   Best       
 ------------------- ------------ --------- --------- ----------- -----------
  rdist2.mtx          0.76         2.65      4.95      1.39        Best 2D    
  cavity10.mtx        0.71         4.02      6.36      1.84        Best 2D    
  raefsky2.mtx        0.91         9.34      14.01     4.33        Best 2D    
  bcsstk17.mtx        1.52         5.79      12.80     4.96        Best 2D    
  lung2.mtx           4.19         1.17      7.64      11.87       1Thread T  
  thermal1.mtx        3.10         1.69      8.09      11.61       1Thread T  
  thermomech_TK.mtx   3.23         1.68      6.44      7.95        1Thread T  
  rma10.mtx           1.56         7.51      17.33     13.41       Best 2D    
  roadNet-PA.mtx      3.72         0.71      4.17      12.69       1Thread T  
  cant.mtx            1.51         8.74      18.90     18.38       Best 2D    
  PR02R.mtx           1.54         7.32      17.85     18.44       1Thread T  
  af_1_k101.mtx       1.78         6.65      20.40     25.34       1Thread T  
  ML_Laplace.mtx      1.59         10.17     23.08     24.82       1Thread T  
  nlpkkt80.mtx        2.01         5.96      16.81     30.97       1Thread T  
  Cube_Coup_dt0.mtx   1.59         9.84      21.24     26.59       1Thread T

The results shown in Table 3 and Figure 22 are interesting because it shows that using only 1 Thread per row is not efficient at all, but when the arrays are transposed it became suddenly one of the most efficient methods. The difference in efficiently using the same amount of thread is impressive. By transposing the arrays, it’s in most cases more efficient to use only 1 thread by row than several.

This drastic difference in efficiency, while using the same amount of threads, is because when the matrix is stored in the ELLPACK format, coefficients within the same column are stored adjacent to each other in memory. Therefore, to ensure consecutive threads access consecutive entries, the matrix should be transposed in memory. By transposing the arrays, the memory access pattern is optimized, leading to better cache utilization and memory coalescing. As a result, in most cases, it is more efficient to use only one thread per row when the matrix is transposed.

In C++, memory allocation for 2D arrays is conventionally arranged in row-major order, such that each row of the array is stored in contiguous memory locations. However, when performing matrix-vector multiplication in CUDA, where each thread block is typically assigned to a set of rows in the matrix, conflicts may come if multiple threads attempt to access memory locations that are closely situated in memory. Such conflicts may arise if the non-zero elements in a row are scattered throughout the array instead of being located in a contiguous block.

To arrange this issue, transposing the matrix can be beneficial. By transposing the matrix, the non-zero elements in each column are brought together into a contiguous block. This reduces the likelihood of thread conflicts, thereby enabling threads to access data with greater efficiency and minimizing the risk of stalling due to memory access. The reduced contention for memory access can lead to improved performance in CUDA, as thread conflicts can result in delays that can slow down the overall computation.

Figure 22: CUDA, Comparison of parallelization methods

CUDA vs OpenMP

To end this result part the difference between CUDA and OpenMP was analyzed as presented in Table 4 and Figures 24 & 23. The result in the table and graphs are sorted by NZ elements because the result indicates that OpenMP is more efficient for smaller matrices and CUDA for big matrices with a large number of NZ as Cube_Coup_dt0.

This difference between CUDA and OpenMP can be explained by the limited number of threads inside a CPU, so OpenMP is limited in its scaling methodology. However, the CPU inside a CPU is more powerful and more efficient. The paradigm in CUDA is the opposite as in a CPU where the goal is to parallelize between a huge number of threads less powerful than the CPU. In addition, the split between CUDA and OpenMP seems to be around 2 500 000 NZ elements, lower than that OpenMP is better, otherwise it is CUDA.

Table 4: CUDA vs OpenMP

       Matrix            NZ       CUDA    OpenMP    Best   
 ------------------- ----------- ------- -------- -------- 
  rdist2.mtx              56934    4.95    11.37   OpenMP  
  cavity10.mtx            76367    6.36    13.93   OpenMP  
  raefsky2.mtx           294276   14.01    16.79   OpenMP  
  bcsstk17.mtx           428650   12.80    21.40   OpenMP  
  lung2.mtx              492564   11.87    20.56   OpenMP  
  thermal1.mtx           574458   11.61    16.91   OpenMP  
  thermomech_TK.mtx      711558    7.95    11.57   OpenMP  
  rma10.mtx             2374001   17.33    27.93   OpenMP  
  roadNet-PA.mtx        3083796   12.69     2.09   CUDA    
  cant.mtx              4007383   18.90     8.16   CUDA    
  PR02R.mtx             8185136   18.44    10.03   CUDA    
  af_1_k101.mtx        17550675   25.34     9.72   CUDA    
  ML_Laplace.mtx       27689972   24.82     9.11   CUDA    
  nlpkkt80.mtx         28704672   30.97     6.93   CUDA    
  Cube_Coup_dt0.mtx   127206144   26.59     9.40   CUDA

Figure 24: CUDA vs OpenMP by matrix, Figure 23: CUDA vs OpenMP by NZ elements

One final observation is the logarithmic linear regression that can be observed with the CUDA result between the GFlops and the NZ elements in Figure 25. This relationship shows us the capability of CUDA to scale efficiently for larger matrices.

Figure 25: Linear regression on CUDA parallelization by NZ elements

Conclusion

To conclude, the results of the OpenMP implementation reveal that the Compressed Sparse Row (CSR) format outperforms the ELLPACK format, and that horizontal unrolling is more efficient than vertical unrolling. However, increasing the level of unrolling does not appear to have a significant impact on the efficiency of the kernel, as it may lead to suboptimal memory allocation and access. The GFlops exhibit a positive correlation with the number of threads utilized with OpenMP, and the impact of non-zero elements (NZ) is also evident. The optimal algorithm for unrolling is a level two horizontal unrolling, followed by a level two vertical unrolling, with some matrices achieving around 5 speedups for 16 threads using OpenMP and a level two horizontal unrolling.

On the other hand, the CUDA implementation indicates that a two-dimensional (2D) block is more effective than a one-dimensional (1D) block and that the ELLPACK format is more efficient than the CSR format. However, unrolling is not efficient in CUDA, and the most effective BlockSize is 16, while the worst is 64. There is a correlation between the GFlops and the number of non-zero (NZ) elements, and the reduction necessitates synchronization between threads, which may result in an inefficient algorithm for too many threads. Lastly, by transposing the JA and AS arrays of the ELLPACK format and employing a distinct parallelization methodology using CUDA, a more efficient algorithm can be achieved.

Find more of the code here.

References

[1] T. Davis. Wilkinson’s sparse matrix definition (2007)

[2] Salvatore Filippone, Valeria Cardellini, Davide Barbieri Alessandro Fanfarillo (2017), Sparse Matrix-Vector Multiplication on GPGPUs

[3] Nathan Bell, Michael Garland (2008), Efficient Sparse Matrix-Vector Multiplication on CUDA

[4] John Mellor-Crummey, John Garvin (2003), Optimizing Sparse Matrix-Vector Product Computations Using Unroll and Jam

[5] Salvatore Filippone (2023), Small Scale Parallel Programming, Cranfield University

Learn more

Solving the Traveling Salesman Problem with Parallel Computing

Multiple Linear Regression, Gradient Descent /w Python