Consider the following code that adds two matrices A and B and stores the result in a matrix C:
for (i= 0 to 15) {
for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}
If we had a quad-core multiprocessor, where the elements of the matrices A, B, C are stored in row major order, which one of the following two parallelizations is better and why ? What about when they are stored in column major order ?
(a) For each Pk in {0, 1, 2, 3}:
for (i= 0 to 15) {
for (j= Pk*15 + Pk to (Pk+1)*15 + Pk)
{
// Inner Loop Parallelization C[i][j] = A[i][j] + B[i][j];
}
}
(b) For each Pk in {0, 1, 2, 3}:
for (i= Pk*3 + Pk to (Pk+1)*3 + Pk) {
// Outer Loop Parallelization for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}
Correct Answer:
Verified
Q1: Applying the send/receive programming model as outlined
Q3: Why should there be stride-access for vector
Q4: Consider a system with two multiprocessors with
Q5: Consider a multi-core processor with heterogeneous cores:
Q6: Suppose we have a dual core chip
Q7: Vector architecture exploits the data-level parallelism to
Q8: Consider a multi-core processor with 64
Q9: Consider the following GPU that consists of
Q10: How would you rewrite the following sequential
Q11: Besides network bandwidth and bisection bandwidth, two
Unlock this Answer For Free Now!
View this answer and more for free by performing one of the following actions
Scan the QR code to install the App and get 2 free unlocks
Unlock quizzes for free by uploading documents