Enhancing Performance Of Shared Memory Distributed System

Read Complete Research Material



Enhancing Performance of Shared Memory Distributed System



Enhancing Performance of Shared Memory Distributed System

Proposed Abstract

Sparse-matrix vector multiplication is an important kernel that often runs inefficiently on superscalar RISC processors. This paper describes techniques that increase instruction-level parallelism and improve performance in India. The techniques include reordering to reduce cache misses, blocking to reduce load instructions, and prefetching to prevent multiple load-store units from stalling simultaneously. (Agarwal, Gustavson, Zubair, 2002) The techniques improve performance from about 40 MFLOPS (on a well-ordered matrix) to more than 100 MFLOPS on a 266-MFLOPS machine. The techniques are applicable to other superscalar RISC processors as well, and have improved performance on a Sun UltraSPARC Trademark I workstation, for example.

Proposed literature review

The code assumes that the matrix is stored in a compressed-row format, but the same considerations apply to other storage formats that support general sparsity patterns. The inner loop of the code loads a(jp),colind(jp), and x(colind(jp)), and performs one multiply-add operation. While a and colind are loaded using a stride-1 access pattern, x(colind(jp)) may be any element of x. (Agarwal, Gustavson, Zubair, 2002)

There are four potential performance problems in this code. The accesses to a and colind generate many cache misses, one per cache line (because the stride-1 access ensures that the entire line is used before it is evicted from the cache). Depending on the number of nonzeros in A and on the details of the iterative algorithm in which the matrix vector multiplication is used, these cache misses can occur in the first-level cache or in a cache further away from the processor. The accesses to x can have poor spatial and temporal locality, and can hence generate even more cache misses. (Agarwal, Gustavson, Zubair, 2002) The ratio of data values loaded into registers per floating-point operation is 1.5, which means that the code's performance is limited by the performance of the processor's load/store units. Finally, the conversion of colind(jp) from an integer index to a byte offset from the beginning of x, required on most processors for indirect addressing, requires the integer ALUs to perform an additional instruction in every iteration. This section describes four techniques that can cope with these problems.

Reducing cache misses through bandwidth reduction

The bandwidth of a sparse matrix is the maximum distance, in diagonals, between two nonzero elements of the matrix. Matrix-reordering algorithms that reduce the bandwidth of a matrix have been proposed since the late 1960s as a way to reduce fill and work in sparse-matrix factorizations. The first such technique, which is based on a breadth-first traversal of the graph underlying the matrix, was invented by Cuthill and McKee. A simple modification of their technique, which reverses the ordering produced by the Cuthill-McKee algorithm, was later found to be even more efficient in sparse factorizations. (Satish , Gropp, Barry , 2006)

Das et al. proposed the reordering of sparse matrices using a bandwidth-reducing technique in order to reduce the number of cache misses generated by accesses to x. Temam and Jalby analyzed the number ...
Related Ads