Consider a system with two multiprocessors with the following configurations:
(a) Machine 1, a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word.
(b) Machine 2, a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word.
Suppose an application has two threads running on the two processors, each of them need to access an entire array of 4096 words, is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it rather than the UMA machine? If so, specify the partitioning. If not, by how many more cycles should the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Assume that the memory operations dominate the execution time.
-------
Correct Answer:
Verified
View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Q1: Applying the send/receive programming model as outlined
Q2: Consider the following code that adds two
Q3: Why should there be stride-access for vector
Q5: Consider a multi-core processor with heterogeneous cores:
Q6: Suppose we have a dual core chip
Q7: Vector architecture exploits the data-level parallelism to
Q8: Consider a multi-core processor with 64
Q9: Consider the following GPU that consists of
Q10: How would you rewrite the following sequential
Q11: Besides network bandwidth and bisection bandwidth, two
Unlock this Answer For Free Now!
View this answer and more for free by performing one of the following actions
Scan the QR code to install the App and get 2 free unlocks
Unlock quizzes for free by uploading documents