Consider the following code that adds two matrices A and B and stores the result in a matrix C:
for (i= 0 to 15) {
for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}
If we had a quad-core multiprocessor, where the elements of the matrices A, B, C are stored in row major order, which one of the following two parallelizations is better and why ? What about when they are stored in column major order ?
(a) For each Pk in {0, 1, 2, 3}:
for (i= 0 to 15) {
for (j= Pk*15 + Pk to (Pk+1)*15 + Pk)
{
// Inner Loop Parallelization C[i][j] = A[i][j] + B[i][j];
}
}
(b) For each Pk in {0, 1, 2, 3}:
for (i= Pk*3 + Pk to (Pk+1)*3 + Pk) {
// Outer Loop Parallelization for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}

Question

Quizplus · Accepted Answer

The Answer is When they are stored in row major order, Parallelization (a) is better because by accessing the elements row by row, a thread can take advantage of the spatial locality property of caches. On the other hand, by accessing elements column-wise, there are more likely to be more cache misses as elements in a column are separated by a certain number of columns = row size (in memory and therefore in a cache line). When elements are stored in a column-major order, Parallelization (b) is better.

Consider the Following Code That Adds Two Matrices a and B