A Comparative Study of Automatic Vectorizing Compilers

David Levine * David Callahan † Jack Dongarra ‡

Abstract. We compare the capabilities of several commercially available, vectorizing Fortran compilers using a test suite of Fortran loops. We present the results of compiling and executing these loops on a variety of supercomputers, mini-supercomputers, and mainframes.

1 Introduction

This paper describes the use of a collection of Fortran loops to test the analysis capabilities of automatic vectorizing compilers. An automatic vectorizing compiler is one that takes code written in a serial language (usually Fortran) and translates it into vector instructions. The vector instructions may be machine specific or in a source form such as the proposed Fortran 90 array extensions or as subroutine calls to a vector library.

Most of the loops in the test suite were written by people involved in the development of vectorizing compilers, although several we wrote ourselves. All of the loops test a compiler for a specific feature. These loops reflect constructs whose vectorization ranges from easy to challenging to extremely difficult. We have collected the results from compiling and executing these loops using commercially available, vectorizing Fortran compilers.

The results reported here expand on our earlier work [3]. In that paper, we focused principally on analyzing each compiler’s output listing. For the present study, we ran the loops in both scalar and vector modes. In addition, the set of loops has been expanded.

The remainder of this paper is organized into eight sections. Section 2 describes our classification scheme for the loops used in the test. In Section 3 we describe the structure of the test program. In Section 4 we describe the methodology used to perform the test. Section 5 reports on the number of loops that vectorized according to the compiler’s output listing. Section 6 presents two aspects of the speedup results. In Section 7 we discuss our model of optimal vector performance and present the results of comparing the actual performance with the model. Section 8 discusses several aspects of the test. In Section 9 we make some remarks about future work.

*Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL 60439-4801. This work was supported by the Applied Mathematical Sciences subprogram of the Office of Energy Research, U. S. Department of Energy, under Contract W-31-109-Eng-38.
†Tera Computer, 400 North 34th Street, Suite 300, Seattle, WA 98103
‡Computer Science Department, University of Tennessee, Knoxville, TN 37996-1301. This work was supported in part by the Applied Mathematical Sciences Research Program, Office of Energy Research, U. S. Department of Energy, under Contract DE-AC05-84OR21400 and in part by NSF cooperative agreement CCR-8809615.
2 Classification of Loops

The objective of the test suite is to test four broad areas of a vectorizing compiler: dependence analysis, vectorization, idiom recognition, and language completeness. All of the loops in this suite are classified into one of these categories.

We define all terms and transformation names but discuss dependence analysis and program transformation only briefly. Recent discussions of these topics can be found in Allen and Kennedy [2], Padua and Wolfe [11], and Wolfe [14]. For a practical exposition of the application of these techniques, see Levesque and Williamson [6].

2.1 Dependence Analysis

Dependence analysis comprises two areas: global data-flow analysis and dependence testing. Global data-flow analysis refers to the process of collecting information about array subscripts. Dependence testing refers to the process of testing for memory overlaps between pairs of variables in the context of the global data-flow information.

Dependence analysis is the heart of vectorization, but it can be done with very different levels of sophistication ranging from simple pattern matching to complicated procedures that solve systems of linear equations. Many of the loops in this section test the aggressiveness of the compiler in normalizing subscript expressions into linear form for the purpose of enhanced dependence testing.

1. Linear Dependence Testing. Given a pair of array references whose subscripts are linear functions of the loop control variables that enclose the references, decide whether the two references ever access the same memory location. When the references do interact, additional information can be derived to establish the safety of loop restructuring transformations.

2. Induction Variable Recognition. Recognize auxiliary induction variables (e.g., variables defined by statements such as K = K + 1 inside the loop). Once recognized, occurrences of the induction variable can be replaced with expressions involving loop control variables and loop invariant expressions.

3. Global Data-Flow Analysis. Collect global (entire subroutine) data-flow information, such as constant propagation or linear relationships among variables, to improve the precision of dependence testing.

4. Nonlinear Dependence Testing. Given a pair of array references whose subscripts are not linear functions, test for the existence of data dependencies and other information.

5. Interprocedural Data-Flow Analysis. Use the context of a subroutine in a particular program to improve vectorization. Possibilities include in-line expansion, summary information (e.g., which variables may or must be modified by an external routine), and interprocedural constant propagation.

6. Control Flow. Test to see whether certain vectorization hazards exist and whether there are implied dependencies of a statement on statements that control its execution.
7. **Symbolics.** Test to see whether subscripts are linear after certain symbolic information is factored out or whether the results of dependence testing do not, in fact, depend on the value of symbolic variables.

### 2.2 Vectorization

A simple vectorizer would recognize single-statement Fortran DO loops that are equivalent to hardware vector instructions. When this strict syntactic requirement is not satisfied, more sophisticated vectorizers can restructure programs so that it is. Here, program restructuring is divided into two categories: transformations to enhance vectorization and idiom recognition. The first is described here, and the other in the next section.

1. **Statement Reordering.** Reorder statements in a loop body to allow vectorization.
2. **Loop Distribution.** Split a loop into two or more loops to allow partial vectorization or more effective vectorization.
3. **Loop Interchange.** Change the order of loops in a loop nest to allow or improve vectorization. In particular, make a vectorizable outer loop the innermost in the loop nest.
4. **Node Splitting.** Break up a statement within a loop to allow (partial) vectorization.
5. **Scalar and Array Expansion.** Expand a scalar into an array or an array into a higher-dimensional array to allow vectorization and loop distribution.
6. **Scalar Renaming.** Rename instances of a scalar variable. Scalar renaming eliminates some interactions that exist only because of reuse of a temporary variable and allows more effective scalar expansion and loop distribution.
7. **Control Flow.** Convert forward branching in a loop into masked vector operations; recognize loop invariant IF’s (loop unswitchoing).
8. **Crossing Thresholds (Index Set Splitting).** Allow vectorization by blocking into two sets. For example, vectorize the statement \( A(I) = A(N-I) \) by splitting iterations of the I loop into iterations with \( I < N/2 \) and iterations with \( I > N/2 \).
9. **Loop Peeling.** Unroll the first or last iteration of a loop to eliminate anomalies in control flow or attributes of scalar variables.
10. **Diagonals.** Vectorize diagonal accesses (e.g., \( A(I,I) \)).
11. **Wavefronts.** Vectorize two-dimensional loops with dependencies in both dimensions by restructuring the loop for diagonal access.

### 2.3 Idiom Recognition

Idiom recognition refers to the identification of particular program forms that have (presumably faster) special implementations.
1. **Reductions.** Compute a scalar value or values from a vector, such as sum reductions, min/max reductions, dot products, and product reductions.

2. **Recurrences.** Identify special first- and second-order recurrences that have logarithmically faster solutions or hardware support.

3. **Search Loops.** Search for the first or last instance of a condition, possibly saving index value(s).

4. **Packing.** Scatter or gather a sparse vector from or into a dense vector under the control of a bit mask or an indirection vector.

5. **Loop Rerolling.** Vectorize loops where the inner loop has been unrolled.

### 2.4 Language Completeness

This section tests how effectively the compilers understand the complete Fortran language. Simple vectorizers might limit analysis to DO loops containing only floating point and integer assignments. More sophisticated compilers will analyze all loops and vectorize wherever possible.

1. **Loop Recognition.** Recognize and vectorize loops formed by backward GO TO’s.

2. **Storage Classes and Equivalencing.** Understand the scope of local vs. common storage; correctly handle equivalencing.

3. **Parameters.** Analyze symbolic named constants, and vectorize statements that refer to them.

4. **Nonlogical IF’s.** Vectorize loops containing computed GO TO’s and arithmetic IF’s.

5. **Intrinsic Functions.** Vectorize functions that have elemental (vector) versions such as SIN and COS or known side effects.

6. **Call Statements.** Vectorize statements in loops that contain CALL statements or external function invocations.

7. **Nonlocal GO TO’s.** Branches out of loops, RETURN statements or STOP statements inside of loops.

8. **Vector Semantics.** Load before store, and preserve order of stores.

9. **Indirect Addressing.** Vectorize subscripted subscript references (e.g., A(INDEX(I))) as Gather/Scatter.

10. **Statement Functions.** Vectorize statements that refer to Fortran statement functions.
3 Test Program Structure

The test program consists of 122 loops that represent different constructs intended to test the analysis capabilities of a vectorizing compiler. Using the classification scheme in Section 2, there are 29 loops in the Dependence Analysis category, 41 loops in the Vectorization category, 24 loops in the Idiom Recognition category, and 28 loops in the Language Completeness category. Also included are 13 additional “control” loops we expect all compilers to be able to vectorize. These allow us to measure the rates of certain basic operations for use with the model discussed in Section 7.

The majority of the test loops operate on one-dimensional arrays; a small number operate on two-dimensional arrays. Most of the loops in the test are fairly short; many are a single statement and others usually no more than several statements. Many of the loops access memory with a stride of one. Each loop is contained in a separate subroutine. A driver routine calls each subroutine with vector lengths of 10, 100, and 1000.

An example loop is shown in Figure 1. Relevant operands are initialized once at the start of the loop. An outer repetition loop is used to increase the granularity of the calculation, thereby avoiding problems with clock resolution. A call to a dummy subroutine is included in each iteration of the repetition loop so that, in cases where the inner loop calculation is invariant with respect to the repetition loop, the compiler is still required to execute each iteration rather than just recognizing that the calculation needs to be done only once.

After execution of the loop is complete, a checksum is computed by using the result array(s). The checksum and the time used are then passed to a check subroutine. The check subroutine verifies the checksum with a precomputed result and prints out the time to execute the loop. The time is calculated by calling a timer at the start of the loop and again at the end of the loop and taking the difference of these times minus the cost of the timing call and the cost of the multiple calls to the dummy subroutine.

4 Test Methodology

The test program is distributed in two files: a driver program in one file, and the test loops in the other. The files were distributed to interested vendors, who were asked to compile the loops without making any changes* using only the compiler options for automatic vectorization. Thus, the use of compiler directives or interactive compilation features to gain additional vectorization was not tested. Vendors were asked to make two separate runs of the test: one using scalar optimizations only, and the other using the same scalar optimizations and, in addition, all automatic vectorization options. Vendors with multiprocessor computers submitted uniprocessor results only. Appendix A contains details of the exact machine configurations and versions of the software used.

The rules require separate compilation of the two files. The rules for compilation of the

* One vendor was allowed to (1) separate a 135-way IF-THEN-ELSEIF-ELSE construct in order to overcome a self-imposed limit, and (2) include the array declarations in a common block in the driver program (only) in order to overcome a self-imposed limit on memory allocation size. Neither modification had any impact on performance.
subroutine s111 (ntimes, ld, n, ctime, dtime, a, b, c, d, e, aa, bb, cc)
integer ntimes, ld, n, i, nl
real a(n), b(n), c(n), d(n), e(n), aa(ld, n), bb(ld, n), cc(ld, n)
real t1, t2, second, checksum, ctime, dtime, csid
call init(ld, n, a, b, c, d, e, aa, bb, cc, 's111 ')
t1 = second()
do 1 nl = 1, 2*ntimes
  do 10 i = 2, n, 2
    a(i) = a(i-1) + b(i)
  continue
1 continue
call dummy(ld, n, a, b, c, d, e, aa, bb, cc, i.)
1 continue
t2 = second() - t1 - ctime - ( dtime * float(2*ntimes) )
checksum = csid(n, a)
call check (checksum, 2*ntimes*(n/2), n, t2, 's111 ')
return
end

Figure 1: Example loop

driver file require that no compiler optimizations be used and that the file not be analyzed interprocedurally to gather information useful in optimizing the test loops.

The file containing the loops was compiled twice—once for the scalar run and once for the vector run. For the scalar run, global (scalar) optimizations were used. For the vector run, in addition to the same global optimizations specified in the scalar run, vectorization and — if available — automatic call generation to optimized library routines, function inlining, and interprocedural analysis were used.

All files were compiled to use 64-bit arithmetic. Most runs were made on standalone systems.* For virtual memory computers, the runs were made with a physical memory and working-set size large enough that any performance degradation from page faults was negligible. In all cases the times reported to us were user CPU time.

After compiling and executing the loops, the vendors sent back the compiler’s output listing (source echo, diagnostics, and messages) and the output of both the scalar and vector runs. We then examined the compiler’s output listings to see which loops had been vectorized, and analyzed the scalar and vector results. In addition to measuring the execution time of the loops, we checked the numerical result in order to verify correctness. However, the check was strictly for correctness of the numerical result; no attempt was made to see whether possibly unsafe transformations had been used.

---

*The Cray Computer and Hitachi runs were not.
5 Number of Loops Vectorized

In this section we discuss the number of loops that were vectorized, as reported by the compiler’s output listing. All of the loops in our test are amenable to some degree of vectorization. For some loops, this may only be partial vectorization; for others, vectorization may require the use of optimized library routines or special hardware.

5.1 Definition of Vectorization

We define a statement as vectorizable if one or more of the expressions in the statement involve array references or may be converted to that form. We define three possible results for a compiler attempting to vectorize a loop. A loop is vectorized if the compiler generates vector instructions for all vectorizable statements in the loop. A loop is partially vectorized if the compiler generates vector instructions for some, but not all, vectorizable statements in the loop. No threshold is defined for what percentage of a loop needs to be vectorized to be listed in this category, only that some expression in a statement in the loop is vectorized. A loop is not vectorized if the compiler does not generate vector instructions for any vectorizable statements within the loop.

For some loops the Cray Research, FPS Computing, IBM, and NEC compilers generated a runtime IF-THEN-ELSE test which executed either a scalar loop or a vectorized loop. These loops have been scored as either vectorized or not vectorized according to whether or not vectorized code was actually executed at runtime.

The Cray Computer compiler “conditionally vectorized” certain loops. That is, for loops with ambiguous subscripts, a runtime test was compiled that selected a safe vector length.* These loops have been scored as either vectorized if the safe vector length was greater than one, otherwise not vectorized.

For a number of loops, the Fujitsu compiler generated scalar code even though the compiler indicated that partial vector code could be generated. In these cases, the compiler listing contained the message “Partial vectorization overhead is too large,” indicating that although partial vectorization was possible, for these loops the compiler considered scalar code more efficient. These loops have been scored as partially vectorized.

Our definition of vectorization counts as vectorized those loops that are recognized by the compiler and automatically replaced by calls to optimized library routines. In some cases a compiler may generate a call to an optimized library routine rather than explicitly generating vector code. Typical examples are for certain reduction and recurrence loops. Often the library routines use a mix of scalar and vector instructions; while perhaps not as fast as pure vector loops, since the construct itself is not fully parallel, they are usually faster than scalar execution. In all cases where the compiler automatically generated a call to a library routine, we have scored the loop as vectorized.

*A safe vector length is one that allows the compiler to execute vector instructions and still produce the correct result. For example, the statement A(I)=A(I-7) with loop increment one may be executed in vector mode with any vector length less than or equal to 7.
Table 1: Full Vectorization (122 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>68.0 (83)</td>
<td>10.7 (13)</td>
<td>21.3 (26)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>60.7 (74)</td>
<td>1.6 (2)</td>
<td>37.7 (46)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>77.9 (95)</td>
<td>8.2 (10)</td>
<td>13.9 (17)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>60.7 (74)</td>
<td>3.3 (4)</td>
<td>36.1 (44)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>72.1 (88)</td>
<td>4.9 (6)</td>
<td>23.0 (28)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>71.3 (87)</td>
<td>16.4 (20)</td>
<td>12.3 (15)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>71.3 (87)</td>
<td>7.4 (9)</td>
<td>21.3 (26)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>77.9 (95)</td>
<td>4.9 (6)</td>
<td>17.2 (21)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>72.1 (88)</td>
<td>5.7 (7)</td>
<td>22.1 (27)</td>
</tr>
<tr>
<td>Average</td>
<td>70.2 (85)</td>
<td>7.0 (8)</td>
<td>22.8 (27)</td>
</tr>
</tbody>
</table>

V – vectorized  
P – partially vectorized  
N – not vectorized  
V/P – fully or partially vectorized

Figure 2: Key to symbols for Tables 1–8, 12–13

5.2 Results

Tables 1–6 list the results of analyzing the compilers’ listings. Each table contains the percentage of loops in each column, followed by the actual number in parentheses. Table 1 summarizes the results for all 122 loops. Table 2 is also a summary of all the loops; here, however, the column V/P counts the loops that were either fully or partially vectorized. Tables 3–6 contain results by category as defined in Section 2.

Table 2: Full and Partial Vectorization (122 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V/P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>78.7 (96)</td>
<td>21.3 (26)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>62.3 (76)</td>
<td>37.7 (46)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>86.1 (105)</td>
<td>13.9 (17)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>63.9 (78)</td>
<td>36.1 (44)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>77.0 (94)</td>
<td>23.0 (28)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>87.7 (107)</td>
<td>12.3 (15)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>78.7 (96)</td>
<td>21.3 (26)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>82.8 (101)</td>
<td>17.2 (21)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>77.9 (95)</td>
<td>22.1 (27)</td>
</tr>
<tr>
<td>Average</td>
<td>77.2 (94)</td>
<td>22.8 (27)</td>
</tr>
</tbody>
</table>
Table 3: Dependence Analysis (29 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>65.5 (19)</td>
<td>17.2 (5)</td>
<td>17.2 (5)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>69.0 (20)</td>
<td>0.0 (0)</td>
<td>31.0 (9)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>86.2 (25)</td>
<td>0.0 (0)</td>
<td>13.8 (4)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>69.0 (20)</td>
<td>0.0 (0)</td>
<td>31.0 (9)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>82.8 (24)</td>
<td>0.0 (0)</td>
<td>17.2 (5)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>65.5 (19)</td>
<td>24.1 (7)</td>
<td>10.3 (3)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>55.2 (16)</td>
<td>10.3 (3)</td>
<td>34.5 (10)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>86.2 (25)</td>
<td>0.0 (0)</td>
<td>13.8 (4)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>75.9 (22)</td>
<td>6.9 (2)</td>
<td>17.2 (5)</td>
</tr>
<tr>
<td>Average</td>
<td>72.8 (21)</td>
<td>6.5 (1)</td>
<td>20.7 (6)</td>
</tr>
</tbody>
</table>

Table 4: Vectorization (41 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>73.2 (30)</td>
<td>14.6 (6)</td>
<td>12.2 (5)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>34.1 (14)</td>
<td>4.9 (2)</td>
<td>61.0 (25)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>56.1 (23)</td>
<td>22.0 (9)</td>
<td>22.0 (9)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>58.5 (24)</td>
<td>7.3 (3)</td>
<td>34.1 (14)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>61.0 (25)</td>
<td>14.6 (6)</td>
<td>24.4 (10)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>68.3 (28)</td>
<td>24.4 (10)</td>
<td>7.3 (3)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>78.0 (32)</td>
<td>9.8 (4)</td>
<td>12.2 (5)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>75.6 (31)</td>
<td>12.2 (5)</td>
<td>12.2 (5)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>65.9 (27)</td>
<td>12.2 (5)</td>
<td>22.0 (9)</td>
</tr>
<tr>
<td>Average</td>
<td>63.4 (26)</td>
<td>13.6 (5)</td>
<td>23.0 (9)</td>
</tr>
</tbody>
</table>

Table 5: Idiom Recognition (24 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>66.7 (16)</td>
<td>4.2 (1)</td>
<td>29.2 (7)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>70.8 (17)</td>
<td>0.0 (0)</td>
<td>29.2 (7)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>87.5 (21)</td>
<td>4.2 (1)</td>
<td>8.3 (2)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>54.2 (13)</td>
<td>4.2 (1)</td>
<td>41.7 (10)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>70.8 (17)</td>
<td>0.0 (0)</td>
<td>29.2 (7)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>87.5 (21)</td>
<td>8.3 (2)</td>
<td>4.2 (1)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>91.7 (22)</td>
<td>4.2 (1)</td>
<td>4.2 (1)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>58.3 (14)</td>
<td>0.0 (0)</td>
<td>41.7 (10)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>87.5 (21)</td>
<td>0.0 (0)</td>
<td>12.5 (3)</td>
</tr>
<tr>
<td>Average</td>
<td>75.0 (18)</td>
<td>2.8 (0)</td>
<td>22.2 (5)</td>
</tr>
</tbody>
</table>
Table 6: Language Completeness (28 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>64.3 (18)</td>
<td>3.6 (1)</td>
<td>32.1 (9)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>82.1 (23)</td>
<td>0.0 (0)</td>
<td>17.9 (5)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>92.9 (26)</td>
<td>0.0 (0)</td>
<td>7.1 (2)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>60.7 (17)</td>
<td>0.0 (0)</td>
<td>39.3 (11)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>78.6 (22)</td>
<td>0.0 (0)</td>
<td>21.4 (6)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>67.9 (19)</td>
<td>3.6 (1)</td>
<td>28.6 (8)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>60.7 (17)</td>
<td>3.6 (1)</td>
<td>35.7 (10)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>89.3 (25)</td>
<td>3.6 (1)</td>
<td>7.1 (2)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>64.3 (18)</td>
<td>0.0 (0)</td>
<td>35.7 (10)</td>
</tr>
<tr>
<td>Average</td>
<td>73.4 (20)</td>
<td>1.6 (0)</td>
<td>25.0 (7)</td>
</tr>
</tbody>
</table>

5.3 Analysis of Results

The average number of loops vectorized was 70%, and vectorized or partially vectorized was 77%. The best results were 78% and 88%, respectively. Of the 122 loops, only two were not vectorized or partially vectorized by any of the compilers; both loops are vectorizable. There is probably no significant difference between compilers within a few percent of each other. Slight differences may be due to different hardware, the availability of special software libraries, the architecture of a machine being better suited to executing scalar or parallel code for certain constructs, or the makeup of the loops used in our test.

From Table 1 we see that the Cray Research and IBM compilers vectorized the most loops. A large number of other compilers are grouped closely together and only a few loops behind these two. Comparing Table 1 to Table 2, we see that counting partially vectorized loops in the totals allows the Fujitsu compiler to vectorize the most loops. It is interesting to note, however, that of the 20 loops we counted as partially vectorized by the Fujitsu compiler, only two actually resulted in (partial) vector code being executed at runtime. For the other 18 loops the Fujitsu compiler made the decision that it would not be cost effective to partially vectorize them. The Convex compiler also did a significant amount of partial vectorization.

Tables 3–6 show that some compilers did particularly well in certain categories. The Cray Research, FPS Computing, and IBM compilers had the best results in the Dependence Analysis category. The Convex, Hitachi, and IBM compilers had the best results in the Vectorization category. The Cray Research, Fujitsu, Hitachi, and NEC compilers had the best results in the Idiom Recognition category. In the Language Completeness category the Cray Research and IBM compilers had the best results. The Vectorization category seemed the most difficult, with approximately 10% fewer loops vectorized overall than for the other sections.

Certain sections seemed fairly easy, with most vendors vectorizing or partially vectorizing almost all of the loops. Using the classification scheme of Section 2 these sections were linear dependence testing, global data-flow analysis, statement reordering, loop distribution, node splitting, scalar renaming, control flow, diagonals, loop rerolling, parameters, intrinsic functions, indirect addressing, and statement functions.

In some sections, while many vendors vectorized or partially vectorized most loops, various
individual vendors did not do particularly well. These sections were induction variable recognition, interprocedural data-flow analysis, symbolics, scalar and array expansion, reductions, search loops, packing, and nonlogical IF’S.

Some sections were difficult for many compilers. Typically, at least half the vendors missed at least some, and sometimes most, of the loops in these sections. These sections were control flow, loop interchange, index set splitting, loop peeling, recurrences, loop recognition, storage classes and equivalencing, and nonlocal GO TO’s.

A few sections were particularly difficult, with only one or two compilers doing any vectorization at all. These sections were nonlinear dependence testing, wavefronts, and call statements.

We found that some vendors with approximately equal results did much better in one section than another. Certain induction variable tests, interprocedural data-flow analysis, loop interchange, recurrences, loop recognition, storage classes and equivalence statements, and loops with exits were the sections that showed the greatest variation. We conclude that the compiler vendors have focused their efforts on particular subsets of the features tested by the suite. Possible reasons might include hardware differences or (self-imposed) limits on compilation time, compilation memory use, or the size of the generated code.

Individual results, on a loop-by-loop basis, may be found in Appendix B.

6 Speedup

The goal of vectorization is for the vectorized program to execute in less time than the unvectorized program. The metric used is the speedup, $s_p$, defined as $s_p = t_s/t_v$, where $t_s$ is the scalar time and $t_v$ is the vector time. In this section we look at two aspects of speedup. First, does the vector code run slower than the corresponding scalar code? Second, how large a speedup can be gained with vectorization?

6.1 Vectorized Loops Revisted

Ideally the speedup from vectorization (or partial vectorization) should be as large as possible. At a minimum, though the vector code should run at least as fast as the scalar code. However, this minimum is not always achieved, particularly at short vector lengths where there may not be enough work in the loop to overcome the vector startup cost.

Tables 7 and 8 revisit the results in Table 1. The number of loops in each of the different categories is again taken from the compiler listing. In Tables 7 and 8 however, we have not counted as vectorized or partially vectorized any loops where $s_p < .95$. The results in Table 7 are for vector length 100, and the results in Table 8 are for vector length 1000. We have not presented these results for vector length ten since almost all vendors suffer some performance degradation for short vectors.

The results in Tables 7 and 8 are mostly consistent with Table 1. Four of the compilers show

*We use .95 instead of 1 to allow for the possibility of measurement error.
Table 7: Loops Vectorized ($s_p > .95$, Vector length = 100, 122 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>68.0(83)</td>
<td>9.0 (11)</td>
<td>23.0(28)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>60.7 (74)</td>
<td>0.8 (1)</td>
<td>38.5 (47)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>77.9 (95)</td>
<td>7.4 (9)</td>
<td>14.8 (18)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>49.2 (60)</td>
<td>2.5 (3)</td>
<td>48.4 (59)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>71.3 (87)</td>
<td>3.3 (4)</td>
<td>25.4 (31)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>68.9 (84)</td>
<td>13.1 (16)</td>
<td>18.0 (22)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>69.7 (85)</td>
<td>1.6 (2)</td>
<td>28.7 (35)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>71.3 (87)</td>
<td>4.1 (5)</td>
<td>24.6 (30)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>72.1 (88)</td>
<td>2.5 (3)</td>
<td>25.4 (31)</td>
</tr>
<tr>
<td>Average</td>
<td>67.7 (82)</td>
<td>4.9 (6)</td>
<td>27.4 (33)</td>
</tr>
</tbody>
</table>

Table 8: Loops Vectorized ($s_p > .95$, Vector length = 1000, 122 loops)

<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>68.0 (83)</td>
<td>8.2 (10)</td>
<td>23.8 (29)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>60.7 (74)</td>
<td>0.8 (1)</td>
<td>38.5 (47)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>77.9 (95)</td>
<td>7.4 (9)</td>
<td>14.8 (18)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>54.9 (67)</td>
<td>2.5 (3)</td>
<td>42.6 (52)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>71.3 (87)</td>
<td>4.1 (5)</td>
<td>24.6 (30)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>69.7 (85)</td>
<td>15.6 (19)</td>
<td>14.8 (18)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>70.5 (86)</td>
<td>1.6 (2)</td>
<td>27.9 (34)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>73.0 (89)</td>
<td>4.1 (5)</td>
<td>23.0 (28)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>72.1 (88)</td>
<td>2.5 (3)</td>
<td>25.4 (31)</td>
</tr>
<tr>
<td>Average</td>
<td>68.7 (83)</td>
<td>5.2 (6)</td>
<td>26.1 (31)</td>
</tr>
</tbody>
</table>

no degradation on any of the vectorized loops. Three others show a degradation on only one or two loops. Only two compilers show a degradation on any significant number of loops. The results for partial vectorization are also fairly consistent with Table 1, with only one compiler showing any serious number of loops being degraded. There is a large variance in the test suite as to which loops have degraded performance. No particular trend is obvious.

Two compilers also suffered noticeable performance degradations (below 90%) for a significant number of loops (10 or more) that were not vectorized. We believe somehow that the attempt to vectorize interfered with the generation of good scalar code. We view this as a performance bug and have advised the vendors. Other than these cases, the vectorizers rarely generated code that was inferior to the scalar code on vector lengths of 100 or more. An exception is the nine loops the CRAY-2 compiler-generated vector code for with a safe vector length of one. These loops, although scored as not vectorized, had vector execution times that were frequently twice the scalar execution times.

6.2 Aggregate Speedup Results

The speedup that can be achieved on a particular vector computer depends on several factors: the speed of the vector hardware relative to the speed of the scalar hardware, the inherent
vector parallelism in the code of interest, and the sophistication of the compiler in detecting opportunities to generate code to run on the vector hardware. From the perspective of our test, we would like to measure the speedup achieved just from the compiler’s vectorization capabilities. However, speedups are too strongly influenced by architecture and implementation to be meaningful indicators of compiler performance. Therefore, we prefer not to give speedup results for individual vendors which may be misinterpreted as representing compiler performance only. Instead, we present speedup statistics using the aggregate results from all vendors.

Table 9 presents a summary of the speedup results of all vendors. The first four rows present results according to the classification scheme in Section 2. Results are given for vector lengths of 10, 100, and 1000 and, in the last column, the sum over all three vector lengths. Each column contains the arithmetic and harmonic means of the speedups for the loops in that section. The results in the last row are summed over all four sections.

Table 10 contains aggregate statistics for three different levels of vectorization. The format of the table is similar to Table 9. The first row contains speedup statistics for the 771 loops scored as fully vectorized. The second row contains speedup statistics for the 848 loops scored as either fully or partially vectorized. The last row contains speedup statistics for the 77 loops that were partially vectorized.

### 6.3 Discussion of Speedup

As might be expected, at the relatively short vector length of 10, the speedups were not very large. This is particularly true of the Idiom Recognition section, where the methods used to vectorize some of the loops are not amenable to the full speedup that can be provided by the hardware. At vector length 100 most speedups were between three and six. At the longest vector length, 1000, the individual speedups were slightly higher for most. Three vendors however, had very large average speedups (29.4, 33.2, and 39.8) over the scalar speed.

The choice of mean clearly affects the results. In the Vectorization section, the arithmetic mean at vector length 1000 is 21.32, while the harmonic mean is only 1.91. These results show that a relatively small number of large speedups can greatly affect the arithmetic mean.
McMahon [10] and Smith [12] discuss the different means.

If we compare the last row of Table 9 with the first two rows in Table 10, we see better speedups at all vector lengths when we consider only the loops fully or partially vectorized. Of course which loops were included, and how many, varies for each vendor.

In several loops in the test suite, not all statements can be vectorized. A compiler can still improve performance by partial vectorization — vectorizing some, but not all, of the statements. As Table 10 shows, the speedups from partial vectorization are significantly less than those from full vectorization. There are several reasons for this result. First, since by definition partial vectorization vectorizes only some of the statements in a loop, others still run at scalar speeds. Second, our definition of partial vectorization classifies as such a loop that uses any vector instructions, no matter how much of the loop is executed in scalar mode. Finally, many techniques for partial vectorization introduce extra work, such as extra loads and stores and additional loop overhead, which is not required in the original loop.

Even with these caveats we see from the last row in Table 10 that there is still a benefit to be gained from partial vectorization, but primarily at the longer vector lengths. Even more so than with full vectorization, partial vectorization — at least on the test loops — degrades performance at vector length of ten.

7 Percent Vectorization

In this section we focus on the performance of the compiler independent of the computer architecture. We do this by developing a machine-specific model of what optimal vector performance is for each of the loops in our test suite. We then compare the optimal performance predicted by this model with the actual vector execution results to determine the percent of the optimal vector performance actually achieved.

7.1 A Model of Compiler Performance

A simple model of vector performance as a function of vector length is given by the formula [8]

\[ t = t_o + n t_e, \]  
(1)

where \( t \) is the time to execute a vector loop of length \( n \), \( t_o \) is the vector startup time, and \( t_e \) is the time to execute a particular vector element. Equivalent to (1) is the well-known model of Hockney (see Hockney and Jesshope [5]),

\[ t = r_\infty^{-1} (n + n_{1/2}), \]  
(2)

where \( r_\infty \) is the asymptotic performance rate and \( n_{1/2} \) is the vector length necessary to achieve one half the asymptotic rate. Equations (1) and (2) can be shown to be equivalent if we use the definitions \( r_\infty = t_e^{-1} \) and \( n_{1/2} = t_o/t_e \) [5].

As Lubeck [7] points out, neither equation models the stripmining process used by compilers on register-to-register vector computers. Also, (1) and (2) may not reflect the behavior of cache-
based machines under increasing vector lengths (see, for example [1]). Nevertheless, for the purposes of our model we believe (1) and (2) to be sufficient.

By analogy with $r_{\infty}$, for each loop, we define three rates: $r_s$ for the optimized scalar code, $r_v$ for the vector code, and $r_o$ for optimal vector code for the target machine. These rates are defined in units of the number of iterations per second of the loop. We assume $r_s < r_o$, and we expect $r_s \leq r_v \leq r_o$, although (as the previous section indicated) it is possible to have $r_v < r_s$.

Using the scalar and vector data collected, we can solve (2), for each loop, for $r_s$ and $r_v$, respectively. Since we cannot necessarily assume $r_v = r_o$, we must estimate $r_o$. To do this, we assume that the execution time of a loop is determined by the basic operations in the loop. To determine the rate at which basic operations (e.g., addition or load) can be performed, we use the control loops, which we assume can be optimally vectorized.

We divide the basic operations into classes. Each class contains operations that utilize a specific functional unit. For example, Table 11 lists the basic operations in each class for a generic computer with separate load and store pipes.

The list of which operations belong to which classes varies by vendor, primarily with respect to the memory operations. For example, on a machine with separate load and store pipes, the load and gather operations are in one class (they compete for the load pipe), and the store and scatter operations are in another class (they compete for the store pipe). For machines with only one pipe for all memory accesses the four memory operations are all in the same class. Even though these operations all have their own execution rates, when they compete for the same resources they are in the same class.

To model control flow, we assume an “execute under mask” model in which every operation is assumed to be executed in vector mode, and the results of various control paths are merged together. Alternative strategies are possible, such as using compress and expand to perform arithmetic only where selected, but we found that execute under mask was sufficient for our purposes.

On each computer, and for each loop $L$, we estimate its optimal execution rate $r_o$, using

---

<table>
<thead>
<tr>
<th>Class</th>
<th>Operation</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>Load</td>
</tr>
<tr>
<td>0</td>
<td>Gather (Load indirect)</td>
</tr>
<tr>
<td>1</td>
<td>Store</td>
</tr>
<tr>
<td>1</td>
<td>Scatter (Store indirect)</td>
</tr>
<tr>
<td>2</td>
<td>Arithmetic (Add, Multiply)</td>
</tr>
<tr>
<td>2</td>
<td>Reductions</td>
</tr>
</tbody>
</table>

---

*This table could be extended by subdividing classes into special cases. For example, the arithmetic class could be divided into separate addition and multiplication classes. For machines that can execute adds and multiplies concurrently — all machines in this study — these multiple functional units are modeled as simply a higher arithmetic-processing rate. The difference in execution times between computing the elementwise sum of three vectors and the elementwise product was insignificant for all computers. This fact is not surprising, since the rate limiting step for almost all loops in the suite is memory references, and so this distinction would not change our results significantly.
The algorithm shown in Figure 3. Here $C$ represents the set of classes defined for a particular computer, $o$ the operations in a class, $N_o$ the number of instances of $o$ in $L$, and $R_o$ the rate for operation $o$ (in units of operations per second) measured with the control loops. The algorithm assumes that operations in different classes execute concurrently while operations in the same class execute sequentially.

This model is based on the notion of a resource limit, similar to the model used to calculate performance bounds in [9, 13]. We assume that for each loop there exists a particular class of operations that use the same function unit and that the time to execute these operations provides a lower bound on the time to execute the loop. The algorithm in Figure 3 calculates that bound, and we use its reciprocal as $r_o$.

In addition to measuring the basic vector operation rates, we also measure the basic scalar operation rates. For each loop, we then determine which operations can be executed in vector mode and which must be executed in scalar mode. We then modify the algorithm in Figure 3 to use the appropriate rate (vector or scalar) for $R_o$ for each operation.

For each loop and each vendor, we have now determined the three execution rates: $r_o$ using the algorithm given in Figure 3, and $r_s$ and $r_v$ using (2). All three rates were computed by using the data for vector lengths of 100 and 1000. We now define percent vectorization, $p_v$, by the formula

$$p_v = \frac{r_v - r_s}{r_o - r_s}.$$  (3)

With this definition, if a loop’s vector execution rate is the same as the scalar rate, $p_v = 0\%$, and if a loop’s vector execution rate is the same as the optimal vector execution rate, $p_v = 100\%$. We can now classify a loop as vectorized, partially vectorized, or not vectorized according to the value of $p_v$. We do this according to the rule

$$Result = \begin{cases} 
    n & p_v < 10\% \\
    p & 10\% \leq p_v < 50\% \\
    v & 50\% \leq p_v .
\end{cases}$$  (4)

Figure 3: Algorithm for Estimating Optimal Execution Rate
7.2 Example

In this section we show an example of the computation of $p_v$ for two computers, $C_1$ and $C_2$. We assume that $C_1$ has two load pipes and a store pipe and that $C_2$ has one pipe used for both loads and stores.

The example used is the loop shown in Figure 1. For this loop we have the following profile of basic operations,* $N_o$:

<table>
<thead>
<tr>
<th>Load</th>
<th>Store</th>
<th>Gather</th>
<th>Scatter</th>
<th>Arithmetic</th>
<th>Reductions</th>
</tr>
</thead>
<tbody>
<tr>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>

The first number in each pair is the number of scalar operations, and the second is the number of vectorizable operations. In this example, executing the loop requires two vector loads, one vector store, and a vector addition. No scalar operations are required (our model takes into account scalar operations that occur within the loop body, but not scalar operations, such as incrementing the loop control variable or testing for loop termination, that have to do with the loop control itself).

Using the results of the control loops, we have calculated the following basic vector operation rates. The units are in million of operations per second.

<table>
<thead>
<tr>
<th>Computer</th>
<th>Load</th>
<th>Store</th>
<th>Arithmetic</th>
</tr>
</thead>
<tbody>
<tr>
<td>$C_1$</td>
<td>227</td>
<td>150</td>
<td>269</td>
</tr>
<tr>
<td>$C_2$</td>
<td>186</td>
<td>207</td>
<td>286</td>
</tr>
</tbody>
</table>

Using these values and the loop profile above, we can estimate $r_o$ with the algorithm shown in Figure 3. The result of these calculations is that, for $C_1$, the optimal vector execution rate is 114 million iterations per second, and for $C_2$ it is 64 million iterations per second.

Using the scalar and vector results for vector lengths 100 and 1000, we determined the following results for $r_s$ and $r_v$ by solving (2):

<table>
<thead>
<tr>
<th>Computer</th>
<th>$r_s$</th>
<th>$r_v$</th>
</tr>
</thead>
<tbody>
<tr>
<td>$C_1$</td>
<td>12.3</td>
<td>115.</td>
</tr>
<tr>
<td>$C_2$</td>
<td>10.7</td>
<td>19.8</td>
</tr>
</tbody>
</table>

Substituting the appropriate values for $r_s$, $r_v$, and $r_o$, into (3), we calculated $p_v = 100\%$ for $C_1$ and $p_v = 17\%$ for $C_2$. Applying (4), we determined that $C_1$ fully vectorizes this loop and that $C_2$ partially vectorizes this loop.

7.3 Results

Table 12 is similar to Table 1, except here the number of loops vectorized or partially vectorized has been determined by applying (3) and (4) as opposed to analyzing the compiler’s output

*From Appendix C.
<table>
<thead>
<tr>
<th>Computer</th>
<th>V</th>
<th>P</th>
<th>N</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C210</td>
<td>51.6 (63)</td>
<td>14.8 (18)</td>
<td>33.6 (41)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>36.1 (44)</td>
<td>24.6 (30)</td>
<td>39.3 (48)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>54.1 (66)</td>
<td>25.4 (31)</td>
<td>20.5 (25)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>45.1 (55)</td>
<td>8.2 (10)</td>
<td>46.7 (57)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>52.5 (64)</td>
<td>18.0 (22)</td>
<td>29.5 (36)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>53.3 (65)</td>
<td>13.1 (16)</td>
<td>33.6 (41)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>52.5 (64)</td>
<td>15.6 (19)</td>
<td>32.0 (39)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>51.6 (63)</td>
<td>18.9 (23)</td>
<td>29.5 (36)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>46.7 (57)</td>
<td>21.3 (26)</td>
<td>32.0 (39)</td>
</tr>
<tr>
<td>Average</td>
<td>49.3 (60)</td>
<td>17.8 (21)</td>
<td>33.0 (40)</td>
</tr>
</tbody>
</table>

In comparing Table 12 to Table 1 we observe that the results are mostly consistent with Table 1, with a somewhat tighter grouping among vendors with the most loops vectorized. Most compilers vectorized between 20 and 30 loops which did not achieve full vector performance ($p_v \geq 50\%$). Appendix B contains complete results.

Casual inspection of the data indicates that there are a number of loops for which at most one vendor successfully achieved vector performance and all other vendors that vectorized did not. Approximately 23 loops account for most of the differences between the two measures of vectorization. For the most part, these loops are scattered across categories but they include most of the scalar expansion loops, the search loops, the packing loops, and the loops with multiway branching.

Factors other than simple detection of vectorizability are reflected in the computation of vectorization percentages. In particular, traditional optimizations such as common subexpression elimination, register allocation, and instruction scheduling will all influence the quality of the generated code and hence the percentage of vectorization. In this sense, the percentage is more a measure of the overall quality of the compiler generated code.

Optimal code generation and, in particular, instruction scheduling for very simple loops are extremely difficult. For loops with large bodies, heuristic algorithms will usually get within a small number of instructions of what is optimal. When the loop body contains only five or ten instructions, however, being off by a “small” number could cost 25\% of achievable performance. Thus, since almost all of the loops in the suite are very simple, the compilers may perform substantially better on “real” codes than is suggested by Table 12.

That the measured execution rates are lower than what might be expected from “vector” code may be due to model limitations. For example, the model treats unit and nonunit stride vector accesses as equal in cost: there was no convincing evidence that nonunit stride was a factor worth adding to the operation classes listed in Table 11. The other major factor not modeled is the presence of a data cache, its size and its organization. This is discussed in Section 8.3.1.

One issue that biases the results presented here is that we use the measured performance on simple loops to calibrate the model. Thus our “optimal rates” may be significantly below “ma-
machine peaks” since those peaks may be achievable only assuming optimal compilation. Further, if the code generation capabilities of one compiler are generally poor compared with another, then its ability to vectorize may appear inflated, since our estimate of optimal execution rate may be too low. This situation can be corrected by replacing the control loops with numbers derived from hand-crafted assembly routines that would provide estimates of “achievable peaks.” We did not have the resources to generate these numbers for each machine.

8 Discussion

8.1 Validity of the Test Suite

How good is this test suite? The question can be answered in several ways, but we will address three specific areas: coverage, stress, and accuracy.

8.1.1 Coverage

By “coverage” we refer to how well the test suite represents typical, common, or important Fortran programming practices. We would like to assert that high effectiveness on the test suite will correspond to high effectiveness in general. Unfortunately, there is no accepted suite of Fortran programs that can be called representative, and so we have no quantitative way of determining the coverage of our suite. We believe, however, that the method used to select the tests has yielded reasonable coverage. This method consisted of two phases.

In the first phase, a large number of loops were collected from several vendors and interested parties. This gave a diverse set of viewpoints, each with a different machine architecture and hence somewhat different priorities. In some cases the loops represented “real” code from programs that had been benchmarked. The majority, however, were specifically written to test a vectorizing compiler for a particular feature. Independently, the categorization scheme used in Section 2 was developed based on experience and on published literature about vectorization.

In the second phase, the test suite was culled from the collected loops by classifying each loop into one or more categories and then selecting a few representative loops from each category. Our interest was in coverage; and since “representative” is not well defined, we made no attempt to weight some of the subcategories more than others by changing the number of loops. Where we felt that testing a subcategory required a range of situations, we included several loops; in other cases we felt that one or two loops sufficed. There is significant weighting between major categories. For example, the test suite places greater emphasis on basic vectorization (41 loops) than on idiom recognition (24 loops). This weighting was an artifact of the selected categories and was reflected in the original collection of samples. We felt that this weighting was reasonable and made no attempt to adjust it.
Table 13: Loops Sorted by Difficulty

Table 14: Loops Sorted by Difficulty, from [3]

8.1.2 Stress

By “stress” we refer to how effectively the test suite tests the limits of the compilers. We wish the test to be difficult but not impossible. Again there is no absolute metric against which we can measure the test suite, but we can use the performance of the compilers as a measure. Table 13 lists the results for the various compilers. In this table, each row corresponds to a particular compiler. Rows are sorted in order of decreasing full and partial vectorization (see Table 2). Each column corresponds to a particular loop, and the columns are sorted in order of increasing difficulty.

The loop scores at the bottom of Table 13 are based on the number of compilers that vectorized or partially vectorized the loop. Many of the loops are inherently only partially vectorizable, and so we have not attempted to weight full versus partial vectorization. We interpret a low score as an indication of a difficult test. From the table we observe a skewed distribution of results, with many of the loops “easy” (everyone vectorizes) and only a few “difficult” (only one or two vendors even partially vectorizes).

Viewed from a historical perspective, the test appears less stressful now than it did originally. We can see this qualitatively from Table 14, which is reprinted from [3]. Here there seems to be
a more balanced distribution of tests between “easy” and “difficult” when compared to Table 13. Statistics also support this view. In [3] the average number of loops vectorized was 55%, and vectorized or partially vectorized was 61%. Even if we restrict ourselves to just the eight vendors also participating in this test, the previous results are still only 59% and 64%, respectively. In this test the average number of loops vectorized was 70%, and vectorized or partially vectorized was 77%, an improvement of about 15%.

Several factors may be at work. First, compilers have evolved and improved over time. Second, specialized third-party compiler technology is now readily available to interested vendors. Third, for various reasons approximately half the vendors who participated in the previous test did not participate this time. While those who did not participate span the spectrum of previous results, most had results in the lower or middle part of the previous test. While we added new loops to this test (and also deleted a small number), this does not seem to have provided adequate stress. Since one valid use of this test suite is for compiler writers to diagnose system deficiencies, we expect over time that the test will lose its effectiveness to stress compilers.

8.1.3 Accuracy

By “accuracy” we refer to how well the test can measure the quality of a vectorizing compiler. Since the difficulty of the tests was determined by the performance of the compilers, it would be circular now to judge the absolute quality of the compilers by their performance on this suite. What about relative performance? It is tempting to distill the results for each compiler into a single number and use that to compare the systems. Such an approach, however, is clearly incorrect, since these compilers cannot be compared in isolation from the machine environment and target application area for which they were designed.

We conclude that the suite represents reasonable coverage, that the stress may no longer be adequate, and that we cannot determine the accuracy of the suite.

8.2 Beating the Test

Some of the loops were vectorized in ways that defeated the intent of the test. One example is the use of a runtime test. If the compiler cannot determine at compile time whether a loop is safe to vectorize, because of, say, an unknown parameter value, it must either not vectorize the loop or else generate an alternative code runtime test. At runtime, based on the value of the unknown parameter, the test executes either a scalar or a vector version of the loop, as appropriate. In general, we view runtime testing as a good thing to do. It allows vectorization of loops that would not otherwise be vectorized and allows cost-effectiveness decisions to be deferred until runtime. However, it has a negative side. First, the cost of the test is incurred each time the loop is executed. Second, for large loop nests, it is possible to have a combinatorial explosion in the number of tests generated. All of the loops in our test suite can be determined to be vectorizable at compile time, and thus runtime testing is not necessary. The Cray Research, FPS, IBM, and NEC compilers, however, can generate runtime tests and in a few cases were able to “beat the test” this way.

A technique similar to runtime testing is conditional vectorization, which was used by the
Cray Computer compiler. With conditional vectorization, a safe vector length* is calculated at runtime. While conditional vectorization is also good for a compiler to be able to do, it also has a negative side. First, there is the overhead involved in calculating the safe vector length at runtime. Second, if the calculated safe vector length is one, it is more efficient to execute a scalar instruction rather than a vector instruction. None of the loops in our test require conditional vectorization. Nevertheless, the Cray Computer compiler conditionally vectorized 20 loops, 11 of which resulted in a safe vector length greater than one.

Another way compilers defeated the intent of the test was by their ability to vectorize recurrences, using either library routines or special hardware. Several of the tests call for the compiler to split up a loop (loop distribution, node splitting) or change the order of a loop nest (loop interchange) in order to vectorize a loop containing an “unvectorizable” recurrence. Several of the compilers – notably those from Fujitsu, Hitachi, and NEC – were able to directly vectorize some of these loops.

We emphasize that “beating the test” is not a bad thing. While there may be more efficient ways to vectorize the loops, the techniques above are beneficial.

8.3 Caveats

We caution that the results presented here test only one aspect of a compiler and should in no way be used to judge the overall performance of a vectorizing compiler or computer system. The results reflect only a limited spectrum of Fortran constructs. We do not claim these loops are representative of a “real” workload, just that they make an interesting test. Some additional factors are discussed below.

8.3.1 Cache Effects

Two issues may impact machines with data caches. First, to ensure a large enough granularity for timing purposes, we included a repetition loop around the loop of interest. While considered a necessary evil for test purposes, this artificial repetition raises an important question about data locality. The concern is that a cache machine will benefit from the reuse of data loaded into cache on the first trip through the repetition loop and that additional references to main memory will not be necessary.

The second issue concerns the data set size relative to the cache size. A small data set will always fit in the cache. A large data set may not fit in the cache and will cause many performance-degrading cache misses to occur. The paper by Abu-Sufah and Maloney [1] contains a discussion of this issue and its impact on performance. Their uniprocessor performance results on an Alliant FX/8 show that there is only a narrow range of vector lengths for which optimal performance was achieved. Our choice of 10, 100, and 1000 as the vector lengths was somewhat intuitive and was not made with any particular cache size in mind.

---

* See Section 5.1
do i = 2,n
  a(i) = a(i) + b(i)
  b(i) = b(i-1)*b(i-1)*a(i)
  a(i) = a(i) - b(i)
enddo

do vector i = 2,n
  a(i) = a(i) + b(i)
  b(i) = b(i-1)*b(i-1)*a(i)
enddo

do i = 2,n
  b(i) = b(i-1)*b(i-1)*a(i)
enddo

do vector i = 2,n
  a(i) = a(i) - b(i)
enddo

Figure 4:

8.3.2 Loop Granularity

Because of the small granularity of our loops (at most a few statements) the speedups achievable with a certain technique may not achievable on our particular loops. As an example, vectorizing the loop shown on the left in Figure 4 requires splitting the loop into two vectorizable loops and one scalar loop containing the nonlinear recurrence as shown on the right.

For this transformation to be successful, there needs to be enough work in the loop to justify the two additional loop overheads introduced and the extra loads and stores which are not required in the original loop. For this loop, inspection of the compiler listing showed that eight of the nine compilers had partially vectorized the loop, but only three achieved more than 15% of the estimated optimal performance, and only one achieved more than 50%.

8.3.3 Hardware and Software

Some of the loops are really tests of the underlying hardware and may not accurately reflect the ability of the compiler itself. For example, in the statement A(I)=B(INDEX(I)) a compiler may detect the indirect addressing of array B but not generate vector instructions because the computer does not have hardware support for array references of this form. Other examples are loops containing IF tests that may require mask registers, or recurrences that require special library routines.

Several of the computers tested are multiprocessors whose compilers support the generation of both parallel and vector code. Our test involved strictly uniprocessors and may have penalized vendors who have put considerable effort into parallel execution. On some of these machines, parallel execution may be more efficient than vectorization for certain loops.

Another example where the computer architecture may influence the compiler is on machines that have a data cache. Compilers for such machines may concentrate on loop transformations that improve data locality at the expense of adding “simple” vectorization capabilities.

Several vendors have sophisticated tools to aid the user in vectorization. For example, both Fujitsu and NEC offer vectorization tools that interactively assist the user in vectorizing a program. Another example is an interprocedural analysis compiler from Convex, which analyzes
an entire program at once. While all are very sophisticated tools, their use was against our rules.

9 Conclusions and Future Work

Our results indicate that most of the compilers tested are fairly sophisticated, able to use a wide variety of vectorization techniques and transformations. Loops that were considered challenging several years ago, such as indirect addressing or vectorizing loops containing multiple IF tests, now seem routine. While there are still various vectorization challenges left to be met, we are not sure how much they will be addressed in the future. Our perception is that most current compiler work is going into memory hierarchy management, parallel loop generation, highly pipelined scalar processors, and interactive and interprocedural tools. We may well be nearing a plateau as far as how much additional work vendors will put into vectorization techniques alone.

Our test suite continues to evolve from simple inspection of the compiler’s output listing to trying to judge the quality of the execution results. To make the test more meaningful, we plan to add the types of “real” loops found in applications. Real loops present combinations of vectorization problems rather than individual challenges. It will then be interesting to compare results on the “simple” loops with those on the real loops.

A copy of the source code used in the test is available from the NETLIB electronic mail facility [4] at Oak Ridge National Laboratory. To receive a copy of the code, send electronic mail to netlib@ornl.gov. In the mail message, type send vectors from benchmark or send vector d from benchmark to get either the REAL or DOUB LE PRECISION versions, respectively.

Acknowledgments

We thank John Lesvesque, Murray Richman, Steve Wallach, Joel Williamson, and Michael Wolfe for providing many of the loops used in this test. Thanks also to the many people involved in running this test, providing results and constructive feedback on earlier versions.

References


## Appendix A

Table 15: Hardware and Software used in this test.

<table>
<thead>
<tr>
<th>Company</th>
<th>Compiler Version</th>
<th>Compiler Options</th>
<th>OS Version</th>
<th>Main/Cache Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX Computer Corp.</td>
<td>fc 6.1</td>
<td>-O2 -uo -is</td>
<td>OS 9.0</td>
<td>512MB/None</td>
</tr>
<tr>
<td>CONVEX C210</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Cray Computer Corp.</td>
<td>cft77 4.0.1.1</td>
<td>Defaults</td>
<td>UNICOS 6.0</td>
<td></td>
</tr>
<tr>
<td>CRAY-2</td>
<td></td>
<td></td>
<td></td>
<td>1GB/None</td>
</tr>
<tr>
<td>Cray Research, Inc.</td>
<td>CF77 4.0</td>
<td></td>
<td>UNICOS 5.1</td>
<td></td>
</tr>
<tr>
<td>CRAY Y-MP</td>
<td>-Wd&quot;-e78b&quot;</td>
<td></td>
<td></td>
<td>1GB/None</td>
</tr>
<tr>
<td>Digital Equipment Corp.</td>
<td>FORTRAN V5.5,</td>
<td></td>
<td>VMS 5.4</td>
<td>512MB/128KB</td>
</tr>
<tr>
<td>VAXvector 9000-210</td>
<td>HPO V1.0</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>/HPO/VECTOR/BLAS=(INLINE, MAPPED)/ASSUME=NOACC/OPT</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>FPS Computing</td>
<td>f77 4.3</td>
<td></td>
<td>FPK, 4.3.2</td>
<td>256MB/64KB</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>-u -O -Oc inl+ -Oc vec+ -Oc pi+</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Fujitsu</td>
<td>Fortran77EX/VP V11L10</td>
<td></td>
<td>OSIV/MSP AFII &amp;</td>
<td></td>
</tr>
<tr>
<td>VP2600/10</td>
<td>VP(2600),OPT(F),INLINE(EXT(S151S))</td>
<td></td>
<td>VPCF V10L10</td>
<td></td>
</tr>
<tr>
<td></td>
<td>VMSG(DETAIL)</td>
<td></td>
<td></td>
<td>1GB/None</td>
</tr>
<tr>
<td>Hitachi</td>
<td>for77/hap V24-0f</td>
<td></td>
<td>vos3/as jss4 01-02</td>
<td></td>
</tr>
<tr>
<td>S-820/80</td>
<td>sopf,xfunc(xfr), hap(model80,vist),uinlined</td>
<td></td>
<td>512MB/256KB</td>
<td></td>
</tr>
<tr>
<td>IBM Corp.</td>
<td>VS FORTRAN 2.4.0, VAST-2 R2</td>
<td></td>
<td>MVS/ESA SP3.1.0E</td>
<td></td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>vopt(opton=r8 inline=s151s,s152s) copt(opt(3) vec(rep(xlist)))</td>
<td></td>
<td>JES2/SP3.1.1</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>256MB/256KB</td>
</tr>
<tr>
<td>NEC Corp.</td>
<td>f77sx 010</td>
<td></td>
<td>SUPER-UX R1.11</td>
<td></td>
</tr>
<tr>
<td>SX-X/14</td>
<td>-pi *:s151s *:s152s</td>
<td></td>
<td>1GB/64KB</td>
<td></td>
</tr>
</tbody>
</table>
Appendix B

The tables below contain the results, on a loop-by-loop basis, for the 122 loops in the test suite. For each loop three columns of data are given. The first column is the result according to the compiler listing: vectorized (v), partially vectorized (p), or not vectorized (n), as described in Section 5.

The second and third columns were calculated with the model described in Section 7. The second column was calculated by applying (4) to \( p_v \). The third column (enclosed in parentheses) is \( p_v \), the percentage of vectorization, calculated with equation (3). An entry of “(D)” indicates that the vector execution rate was less than the scalar execution rate. Occasionally, the \( p_v \) value will be higher than 100%. Some of these loops — notably s176, s352, and s4116 — seem to have uniformly higher values, while most of the others are scattered throughout the test suite. We have no general explanation for these cases. Two possibilities are limitations of the model or measurement error.

\[ a \] These loops were conditionally vectorized. For loops with ambiguous subscripts, a runtime test was compiled which selected a safe vector length.

\[ b \] These loops were executed in scalar mode. The compiler indicated that partial vectorization was possible but that the overhead was too large.

\[ c \] For these loops a runtime IF-THEN-ELSE test was compiled which executed either a scalar loop or a vectorized loop.

\[ * \] We estimate slightly less than two decimal digits of significance in the timing information collected. Hence, the percentage of vectorization calculations may have error terms of approximately 10% to 15%.

<table>
<thead>
<tr>
<th>Computer</th>
<th>s111</th>
<th>s112</th>
<th>s113</th>
<th>s114</th>
<th>s115</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>v v</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>p p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>p p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s116</th>
<th>s118</th>
<th>s119</th>
<th>s121</th>
<th>s122</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n</td>
<td>n n</td>
<td>v a</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v n</td>
<td>v n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>p p</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v p</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v</td>
<td>v n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
</tbody>
</table>

\[ a \] These loops were conditionally vectorized. For loops with ambiguous subscripts, a runtime test was compiled which selected a safe vector length.

\[ b \] These loops were executed in scalar mode. The compiler indicated that partial vectorization was possible but that the overhead was too large.

\[ c \] For these loops a runtime IF-THEN-ELSE test was compiled which executed either a scalar loop or a vectorized loop.

\[ * \] We estimate slightly less than two decimal digits of significance in the timing information collected. Hence, the percentage of vectorization calculations may have error terms of approximately 10% to 15%.
<table>
<thead>
<tr>
<th>Computer</th>
<th>s123</th>
<th>s124</th>
<th>s125</th>
<th>s126</th>
<th>s127</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>p n (2)</td>
<td>p n (2)</td>
<td>v v (95)</td>
<td>p n (0)</td>
<td>v v (99)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v v (99)</td>
<td>n^a n (D)</td>
<td>v v (56)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>n n (0)</td>
<td>v p (37)</td>
<td>v v (122)</td>
<td>v v (64)</td>
<td>v v (76)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n (D)</td>
<td>n n (D)</td>
<td>v p (45)</td>
<td>n n (0)</td>
<td>v v (119)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>n n (0)</td>
<td>v v (68)</td>
<td>v v (111)</td>
<td>v p (41)</td>
<td>v v (88)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>n n (0)</td>
<td>p^b n (0)</td>
<td>v v (102)</td>
<td>p^6 n (D)</td>
<td>p^6 n (0)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>n n (D)</td>
<td>n n (D)</td>
<td>v v (100)</td>
<td>p (13)</td>
<td>v v (51)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>n n (0)</td>
<td>v p (46)</td>
<td>v v (58)</td>
<td>v v (55)</td>
<td>v v (74)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v v (103)</td>
<td>v n (5)</td>
<td>n n (0)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s128</th>
<th>s131</th>
<th>s132</th>
<th>s141</th>
<th>s151</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v (100)</td>
<td>v v (100)</td>
<td>v v (91)</td>
<td>n n (0)</td>
<td>n n (D)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p (35)</td>
<td>v v (98)</td>
<td>v^a v (66)</td>
<td>n n (0)</td>
<td>v^a v (68)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v (74)</td>
<td>v v (76)</td>
<td>v v (110)</td>
<td>n n (0)</td>
<td>v v (78)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v (53)</td>
<td>v v (110)</td>
<td>v v (108)</td>
<td>n n (1)</td>
<td>n n (4)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>n n (0)</td>
<td>v v (86)</td>
<td>v v (86)</td>
<td>n n (0)</td>
<td>v v (86)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>p^b n (0)</td>
<td>v v (105)</td>
<td>v v (106)</td>
<td>p n (0)</td>
<td>v v (106)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>n n (D)</td>
<td>v v (104)</td>
<td>v v (103)</td>
<td>p n (D)</td>
<td>n n (D)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>n n (0)</td>
<td>v v (101)</td>
<td>v v (95)</td>
<td>n n (1)</td>
<td>v v (101)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>p n (D)</td>
<td>v^c v (83)</td>
<td>v^c v (79)</td>
<td>p n (0)</td>
<td>v v (84)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s152</th>
<th>s161</th>
<th>s162</th>
<th>s171</th>
<th>s172</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v (99)</td>
<td>p n (4)</td>
<td>p n (4)</td>
<td>v v (98)</td>
<td>v v (99)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v^a v (73)</td>
<td>v v (78)</td>
<td>v v (79)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v (118)</td>
<td>n n (0)</td>
<td>n^c n (D)</td>
<td>v v (104)</td>
<td>v v (104)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n (D)</td>
<td>v v (127)</td>
<td>v v (100)</td>
<td>n n (0)</td>
<td>v v (110)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v (96)</td>
<td>n n (0)</td>
<td>n^c n (0)</td>
<td>v v (85)</td>
<td>v v (86)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>n n (0)</td>
<td>v v (147)</td>
<td>p^b n (0)</td>
<td>v v (102)</td>
<td>v v (105)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v (114)</td>
<td>n n (D)</td>
<td>p n (D)</td>
<td>n n (D)</td>
<td>n n (D)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v (100)</td>
<td>v v (78)</td>
<td>n^c n (0)</td>
<td>v v (100)</td>
<td>v v (100)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v (79)</td>
<td>n n (0)</td>
<td>v^c v (98)</td>
<td>v^c v (64)</td>
<td>v v (84)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s173</th>
<th>s174</th>
<th>s175</th>
<th>s176</th>
<th>s211</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>n n (8)</td>
<td>v v (200)</td>
<td>v v (98)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v^a v (68)</td>
<td>v^a v (68)</td>
<td>v^a v (68)</td>
<td>v (57)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v^c v (83)</td>
<td>v v (84)</td>
<td>v v (81)</td>
<td>v v (106)</td>
<td>v v (94)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n p (12)</td>
<td>v v (118)</td>
<td>n n (1)</td>
<td>v v (207)</td>
<td>v v (132)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v^c v (85)</td>
<td>v v (86)</td>
<td>v v (86)</td>
<td>v v (121)</td>
<td>v v (93)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>n n (0)</td>
<td>v v (104)</td>
<td>v v (104)</td>
<td>v v (187)</td>
<td>v v (128)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>n n (D)</td>
<td>n n (D)</td>
<td>n n (D)</td>
<td>v v (175)</td>
<td>v v (91)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v^c v (102)</td>
<td>v v (107)</td>
<td>v v (100)</td>
<td>v v (58)</td>
<td>v v (81)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v^c v (96)</td>
<td>n n (0)</td>
<td>v v (82)</td>
<td>v v (106)</td>
<td>v v (64)</td>
</tr>
<tr>
<td>Computer</td>
<td>s212</td>
<td>s221</td>
<td>s222</td>
<td>s231</td>
<td>s232</td>
</tr>
<tr>
<td>--------------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
</tr>
<tr>
<td>CONVEX C-210</td>
<td>v v  (98)</td>
<td>p p (11)</td>
<td>p n (6)</td>
<td>v v (85)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>n a n (D)</td>
<td>n a n (D)</td>
<td></td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v (92)</td>
<td>p p (38)</td>
<td>p v (65)</td>
<td>v v (80)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v (132)</td>
<td>p n (5)</td>
<td>p n (D)</td>
<td>v v (65)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v (107)</td>
<td>v v (72)</td>
<td>p p (38)</td>
<td>v v (87)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v (107)</td>
<td>v v (62)</td>
<td>p n (0)</td>
<td>v v (132)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v (98)</td>
<td>v v (160)</td>
<td>p n (8)</td>
<td>v v (98)</td>
<td>n n (D)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v (100)</td>
<td>p p (16)</td>
<td>p p (47)</td>
<td>v v (65)</td>
<td>n n (2)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v (81)</td>
<td>v p (29)</td>
<td>p n (10)</td>
<td>v v (87)</td>
<td>n n (0)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s233</th>
<th>s234</th>
<th>s235</th>
<th>s241</th>
<th>s242</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v p (39)</td>
<td>n n (0)</td>
<td>v v (114)</td>
<td>v v (99)</td>
<td>p n (5)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>p a n (D)</td>
<td>n a n (D)</td>
<td>n a n (D)</td>
<td>n n (0)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>p n (4)</td>
<td>n n (2)</td>
<td>v v (78)</td>
<td>v v (52)</td>
<td>p p (39)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v n (3)</td>
<td>n n (0)</td>
<td>v v (78)</td>
<td>v v (81)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>p n (2)</td>
<td>n n (0)</td>
<td>v v (128)</td>
<td>v v (65)</td>
<td>v v (96)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>p n (0)</td>
<td>n n (0)</td>
<td>v v (197)</td>
<td>v v (107)</td>
<td>v v (64)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v p (15)</td>
<td>n n (0)</td>
<td>v v (99)</td>
<td>v v (71)</td>
<td>v v (165)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>p n (D)</td>
<td>v v (66)</td>
<td>v v (89)</td>
<td>v v (67)</td>
<td>p p (25)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v p (30)</td>
<td>n n (0)</td>
<td>v v (104)</td>
<td>v p (32)</td>
<td>v p (30)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s243</th>
<th>s244</th>
<th>s251</th>
<th>s252</th>
<th>s253</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v (98)</td>
<td>v v (81)</td>
<td>v v (95)</td>
<td>v p (50)</td>
<td>v v (98)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v v (94)</td>
<td>n n (0)</td>
<td>v p (22)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v (67)</td>
<td>p n (4)</td>
<td>v v (79)</td>
<td>p n (8)</td>
<td>v p (24)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v p (48)</td>
<td>v v (51)</td>
<td>v v (117)</td>
<td>n n (1)</td>
<td>v v (71)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v (66)</td>
<td>p n (D)</td>
<td>v v (82)</td>
<td>p n (0)</td>
<td>v v (62)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v (104)</td>
<td>v v (105)</td>
<td>v v (153)</td>
<td>p b n (0)</td>
<td>v v (84)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v (87)</td>
<td>p n (D)</td>
<td>v v (93)</td>
<td>v p (41)</td>
<td>v v (92)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v (52)</td>
<td>v p (29)</td>
<td>v v (74)</td>
<td>v p (12)</td>
<td>v v (80)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v (54)</td>
<td>n n (0)</td>
<td>v v (95)</td>
<td>p n (D)</td>
<td>v p (35)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s254</th>
<th>s255</th>
<th>s256</th>
<th>s257</th>
<th>s258</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v p (42)</td>
<td>v p (24)</td>
<td>n n (D)</td>
<td>v n (1)</td>
<td>p n (8)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v p (13)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v p (37)</td>
<td>p n (3)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n (0)</td>
<td>n n (1)</td>
<td>p n (0)</td>
<td>n n (0)</td>
<td>n n (D)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v p (11)</td>
<td>p n (4)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>p b n (0)</td>
<td>p b n (0)</td>
<td>v n (5)</td>
<td>v p (16)</td>
<td>p b n (0)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v p (36)</td>
<td>v p (25)</td>
<td>v n (5)</td>
<td>v p (14)</td>
<td>n n (D)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v p (27)</td>
<td>v p (12)</td>
<td>n n (0)</td>
<td>v n (D)</td>
<td>p n (9)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v n (3)</td>
<td>n n (0)</td>
<td>n n (0)</td>
</tr>
<tr>
<td>Computer</td>
<td>s261</td>
<td>s271</td>
<td>s272</td>
<td>s273</td>
<td>s274</td>
</tr>
<tr>
<td>---------------------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
</tr>
<tr>
<td>CONVEX C-210</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>p n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>p n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s275</th>
<th>s276</th>
<th>s277</th>
<th>s278</th>
<th>s279</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>p n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>n n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>p v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>p p</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>n n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>p n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s2710</th>
<th>s2711</th>
<th>s2712</th>
<th>s281</th>
<th>s291</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>p n</td>
<td>v p</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
<td>v p</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v n</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>n n</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>p n</td>
<td>p n</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v n</td>
<td>v v</td>
<td>v v</td>
<td>p n</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>p n</td>
<td>v v</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s292</th>
<th>s293</th>
<th>s2101</th>
<th>s2102</th>
<th>s2111</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v p</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n</td>
<td>n a</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>n n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>p b</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Computer</td>
<td>s311</td>
<td>s312</td>
<td>s313</td>
<td>s314</td>
<td>s315</td>
</tr>
<tr>
<td>-------------------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
</tr>
<tr>
<td>CONVEX C-210</td>
<td>v v  (100)</td>
<td>v v  (52)</td>
<td>v v  (73)</td>
<td>v v  (91)</td>
<td>v v  (88)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v v  (99)</td>
<td>v v  (98)</td>
<td>v v  (73)</td>
<td>v p  (28)</td>
<td>n n  (0)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v  (100)</td>
<td>v v  (100)</td>
<td>v v  (163)</td>
<td>v p  (38)</td>
<td>v p  (27)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v  (104)</td>
<td>v v  (103)</td>
<td>v v  (98)</td>
<td>v v  (67)</td>
<td>v v  (72)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v  (100)</td>
<td>v p  (44)</td>
<td>v v  (107)</td>
<td>v v  (100)</td>
<td>v p  (45)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v  (98)</td>
<td>v n  (0)</td>
<td>v v  (179)</td>
<td>v v  (99)</td>
<td>v v  (100)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v  (100)</td>
<td>v n  (0)</td>
<td>v v  (117)</td>
<td>v v  (93)</td>
<td>v v  (92)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v  (100)</td>
<td>n n  (0)</td>
<td>v v  (97)</td>
<td>v n  (5)</td>
<td>v n  (5)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v  (102)</td>
<td>v v  (70)</td>
<td>v v  (124)</td>
<td>v p  (40)</td>
<td>v p  (25)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s316</th>
<th>s317</th>
<th>s318</th>
<th>s319</th>
<th>s3110</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v  (90)</td>
<td>v p  (43)</td>
<td>n n  (0)</td>
<td>v v  (98)</td>
<td>n n  (0)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p  (28)</td>
<td>v v  (75)</td>
<td>n n  (0)</td>
<td>v v  (95)</td>
<td>n n  (0)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v p  (38)</td>
<td>v v  (84)</td>
<td>v p  (42)</td>
<td>v v  (67)</td>
<td>p n  (D)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v  (71)</td>
<td>v v  (469)</td>
<td>v p  (45)</td>
<td>p p  (21)</td>
<td>n n  (D)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v  (100)</td>
<td>n n  (0)</td>
<td>v p  (47)</td>
<td>v v  (58)</td>
<td>v p  (23)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v  (99)</td>
<td>v n  (D)</td>
<td>v v  (132)</td>
<td>p n  (0)</td>
<td>v v  (99)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v  (89)</td>
<td>v n  (D)</td>
<td>v v  (88)</td>
<td>v v  (78)</td>
<td>p n  (D)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v n  (0)</td>
<td>n n  (3)</td>
<td>n n  (0)</td>
<td>n n  (0)</td>
<td>v n  (6)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v p  (40)</td>
<td>v v  (123)</td>
<td>v p  (26)</td>
<td>v v  (54)</td>
<td>v p  (25)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s3111</th>
<th>s3112</th>
<th>s3113</th>
<th>s321</th>
<th>s322</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v  (85)</td>
<td>n n  (0)</td>
<td>n n  (0)</td>
<td>n n  (D)</td>
<td>n n  (6)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p  (21)</td>
<td>n n  (0)</td>
<td>v p  (35)</td>
<td>n n  (0)</td>
<td>n n  (0)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v p  (41)</td>
<td>n n  (0)</td>
<td>v v  (68)</td>
<td>v p  (17)</td>
<td>v p  (45)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v  (80)</td>
<td>n n  (4)</td>
<td>v v  (62)</td>
<td>n n  (3)</td>
<td>n n  (0)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v  (78)</td>
<td>n n  (0)</td>
<td>v v  (110)</td>
<td>v v  (104)</td>
<td>v p  (31)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v  (125)</td>
<td>v n  (1)</td>
<td>v v  (130)</td>
<td>v v  (183)</td>
<td>n n  (4)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v  (75)</td>
<td>v n  (6)</td>
<td>v v  (90)</td>
<td>v v  (184)</td>
<td>n n  (D)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v p  (47)</td>
<td>n n  (4)</td>
<td>v n  (8)</td>
<td>n n  (8)</td>
<td>n p  (10)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v p  (36)</td>
<td>n n  (0)</td>
<td>v p  (35)</td>
<td>v v  (113)</td>
<td>n n  (0)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s323</th>
<th>s331</th>
<th>s332</th>
<th>s341</th>
<th>s342</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>p n  (2)</td>
<td>v p  (49)</td>
<td>n n  (0)</td>
<td>v p  (20)</td>
<td>v p  (21)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n  (0)</td>
<td>v p  (40)</td>
<td>v p  (20)</td>
<td>v p  (17)</td>
<td>v p  (29)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>n n  (0)</td>
<td>v p  (46)</td>
<td>v p  (29)</td>
<td>v p  (27)</td>
<td>v p  (19)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n  (5)</td>
<td>n n  (D)</td>
<td>n n  (D)</td>
<td>n n  (D)</td>
<td>n n  (D)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>n n  (0)</td>
<td>v n  (6)</td>
<td>n n  (0)</td>
<td>n n  (0)</td>
<td>n n  (0)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>p b  (10)</td>
<td>v p  (20)</td>
<td>v p  (44)</td>
<td>v p  (44)</td>
<td>v p  (30)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v  (326)</td>
<td>v p  (32)</td>
<td>v p  (43)</td>
<td>v v  (81)</td>
<td>v v  (65)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>n n  (7)</td>
<td>n n  (0)</td>
<td>v v  (70)</td>
<td>v v  (53)</td>
<td>v p  (12)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v  (56)</td>
<td>v p  (11)</td>
<td>n n  (0)</td>
<td>v p  (16)</td>
<td>v p  (16)</td>
</tr>
<tr>
<td>Computer</td>
<td>s343</td>
<td>s351</td>
<td>s352</td>
<td>s353</td>
<td>s411</td>
</tr>
<tr>
<td>----------------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
<td>------</td>
</tr>
<tr>
<td>CONVEX C-210</td>
<td>v n</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>n n</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>v p</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>v p</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>n n</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>n n</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>n n</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>n n</td>
</tr>
<tr>
<td>NEC SX-600J</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>n n</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v p</td>
<td>n n</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s412</th>
<th>s413</th>
<th>s414</th>
<th>s415</th>
<th>s421</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>n n</td>
<td>v v</td>
<td>p n</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p</td>
<td>v v</td>
<td>n n</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n</td>
<td>n n</td>
<td>n n</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>n n</td>
<td>n n</td>
<td>v p</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>n n</td>
<td>n n</td>
<td>v p</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>n n</td>
<td>n n</td>
<td>v n</td>
<td>n n</td>
<td>v v</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s422</th>
<th>s423</th>
<th>s424</th>
<th>s431</th>
<th>s432</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>n n</td>
<td>n n</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>n n</td>
<td>n n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v</td>
<td>v v</td>
<td>n n</td>
<td>v v</td>
<td>v v</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s441</th>
<th>s442</th>
<th>s443</th>
<th>s451</th>
<th>s452</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>n n</td>
<td>n n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v n</td>
<td>v n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n</td>
<td>v n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v n</td>
<td>v n</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v n</td>
<td>v n</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v p</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
<td>v v</td>
</tr>
<tr>
<td>Computer</td>
<td>s453</td>
<td>s471</td>
<td>s481</td>
<td>s482</td>
<td>s491</td>
</tr>
<tr>
<td>-----------------</td>
<td>--------</td>
<td>--------</td>
<td>--------</td>
<td>--------</td>
<td>--------</td>
</tr>
<tr>
<td>CONVEX C-210</td>
<td>v v (61)</td>
<td>n n (0)</td>
<td>n n (1)</td>
<td>n n (1)</td>
<td>v v (118)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p (45)</td>
<td>n n (0)</td>
<td>v v (56)</td>
<td>v v (56)</td>
<td>v v (123)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v p (41)</td>
<td>n n (0)</td>
<td>v v (59)</td>
<td>v v (70)</td>
<td>v v (106)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>n n (0)</td>
<td>v n (D)</td>
<td>n n (D)</td>
<td>n n (D)</td>
<td>v v (149)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v p (36)</td>
<td>n n (2)</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v v (131)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v n (2)</td>
<td>v v (58)</td>
<td>v v (97)</td>
<td>v n (0)</td>
<td>v v (66)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v n (9)</td>
<td>n n (0)</td>
<td>v v (82)</td>
<td>n n (0)</td>
<td>v v (122)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v p (18)</td>
<td>v v (115)</td>
<td>v v (90)</td>
<td>v v (68)</td>
<td>v v (130)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>n n (0)</td>
<td>v v (71)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s4112</th>
<th>s4113</th>
<th>s4114</th>
<th>s4115</th>
<th>s4116</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v v (99)</td>
<td>v v (125)</td>
<td>v v (119)</td>
<td>v v (79)</td>
<td>v v (112)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v v (97)</td>
<td>v v (107)</td>
<td>v v (115)</td>
<td>v v (86)</td>
<td>v v (185)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v v (74)</td>
<td>v v (85)</td>
<td>v v (124)</td>
<td>v v (61)</td>
<td>v p (41)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v (112)</td>
<td>v v (138)</td>
<td>v v (109)</td>
<td>v v (128)</td>
<td>v v (150)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v v (109)</td>
<td>v v (68)</td>
<td>v v (151)</td>
<td>v v (131)</td>
<td>v v (175)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v v (105)</td>
<td>v v (117)</td>
<td>v v (89)</td>
<td>v v (65)</td>
<td>v v (103)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v (77)</td>
<td>v v (140)</td>
<td>v v (123)</td>
<td>v v (73)</td>
<td>v v (132)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v v (108)</td>
<td>v v (63)</td>
<td>v v (101)</td>
<td>v v (101)</td>
<td>v v (181)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v v (99)</td>
<td>v v (100)</td>
<td>v v (100)</td>
<td>v v (100)</td>
<td>v v (100)</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Computer</th>
<th>s4117</th>
<th>s4121</th>
</tr>
</thead>
<tbody>
<tr>
<td>CONVEX C-210</td>
<td>v p (39)</td>
<td>v v (100)</td>
</tr>
<tr>
<td>CCC CRAY-2</td>
<td>v p (41)</td>
<td>v v (100)</td>
</tr>
<tr>
<td>CRI CRAY Y-MP</td>
<td>v p (36)</td>
<td>v v (100)</td>
</tr>
<tr>
<td>DEC VAX 9000-210</td>
<td>v v (67)</td>
<td>v v (138)</td>
</tr>
<tr>
<td>FPS M511EA-2</td>
<td>v p (37)</td>
<td>v v (101)</td>
</tr>
<tr>
<td>Fujitsu VP2600/10</td>
<td>v p (38)</td>
<td>v v (99)</td>
</tr>
<tr>
<td>Hitachi S-820/80</td>
<td>v v (63)</td>
<td>v v (100)</td>
</tr>
<tr>
<td>IBM 3090-600J</td>
<td>v p (30)</td>
<td>v v (100)</td>
</tr>
<tr>
<td>NEC SX-X/14</td>
<td>v n (9)</td>
<td>v v (100)</td>
</tr>
</tbody>
</table>
Appendix C

The tables below contain the operation counts for each loop in the test suite.
<table>
<thead>
<tr>
<th>Loop</th>
<th>Load</th>
<th>Store</th>
<th>Gather</th>
<th>Scatter</th>
<th>Arithmetic</th>
<th>Reductions</th>
</tr>
</thead>
<tbody>
<tr>
<td>s111</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s112</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s113</td>
<td>0.1</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s114</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s115</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s116</td>
<td>0.5</td>
<td>0.5</td>
<td>0.0</td>
<td>0.0</td>
<td>0.5</td>
<td>0.0</td>
</tr>
<tr>
<td>s118</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s119</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s121</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s122</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s123</td>
<td>0.4</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.5</td>
<td>0.0</td>
</tr>
<tr>
<td>s124</td>
<td>0.4</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.5</td>
<td>0.0</td>
</tr>
<tr>
<td>s125</td>
<td>0.3</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s126</td>
<td>0.3</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s127</td>
<td>0.4</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.4</td>
<td>0.0</td>
</tr>
<tr>
<td>s128</td>
<td>0.3</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s131</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s132</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s141</td>
<td>0.1</td>
<td>0.1</td>
<td>0.1</td>
<td>0.1</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s151</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s152</td>
<td>0.4</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.3</td>
<td>0.0</td>
</tr>
<tr>
<td>s161</td>
<td>0.6</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.7</td>
<td>0.0</td>
</tr>
<tr>
<td>s162</td>
<td>0.3</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s171</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s172</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s173</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s174</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s175</td>
<td>0.2</td>
<td>0.1</td>
<td>0.0</td>
<td>0.0</td>
<td>0.1</td>
<td>0.0</td>
</tr>
<tr>
<td>s176</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.0</td>
<td>0.2</td>
<td>0.0</td>
</tr>
<tr>
<td>s211</td>
<td>0.5</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.4</td>
<td>0.0</td>
</tr>
<tr>
<td>s212</td>
<td>0.5</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>0.3</td>
<td>0.0</td>
</tr>
<tr>
<td>s221</td>
<td>0.4</td>
<td>0.2</td>
<td>0.0</td>
<td>0.0</td>
<td>1.3</td>
<td>0.0</td>
</tr>
<tr>
<td>Loop</td>
<td>Load</td>
<td>Store</td>
<td>Gather</td>
<td>Scatter</td>
<td>Arithmetic</td>
<td>Reductions</td>
</tr>
<tr>
<td>------</td>
<td>------</td>
<td>-------</td>
<td>--------</td>
<td>---------</td>
<td>------------</td>
<td>------------</td>
</tr>
<tr>
<td>s222</td>
<td>2</td>
<td>6</td>
<td>1</td>
<td>2</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>s231</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.1</td>
<td>0</td>
</tr>
<tr>
<td>s232</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s233</td>
<td>0.4</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s234</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s235</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s241</td>
<td>0.4</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.4</td>
<td>0</td>
</tr>
<tr>
<td>s242</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>1.3</td>
<td>0</td>
</tr>
<tr>
<td>s243</td>
<td>0.5</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.6</td>
<td>0</td>
</tr>
<tr>
<td>s244</td>
<td>0.4</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.5</td>
<td>0</td>
</tr>
<tr>
<td>s251</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.3</td>
<td>0</td>
</tr>
<tr>
<td>s252</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s253</td>
<td>0.4</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.6</td>
<td>0</td>
</tr>
<tr>
<td>s254</td>
<td>0.1</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s255</td>
<td>0.1</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.3</td>
<td>0</td>
</tr>
<tr>
<td>s256</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s257</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s258</td>
<td>0.4</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.7</td>
<td>0</td>
</tr>
<tr>
<td>s261</td>
<td>0.5</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.3</td>
<td>0</td>
</tr>
<tr>
<td>s271</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.4</td>
<td>0</td>
</tr>
<tr>
<td>s272</td>
<td>0.5</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.7</td>
<td>0</td>
</tr>
<tr>
<td>s273</td>
<td>0.5</td>
<td>0.3</td>
<td>0</td>
<td>0</td>
<td>0.8</td>
<td>0</td>
</tr>
<tr>
<td>s274</td>
<td>0.5</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.4</td>
<td>0</td>
</tr>
<tr>
<td>s275</td>
<td>0.4</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.3</td>
<td>0</td>
</tr>
<tr>
<td>s276</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s277</td>
<td>0.5</td>
<td>0.2</td>
<td>0</td>
<td>0</td>
<td>0.8</td>
<td>0</td>
</tr>
<tr>
<td>s278</td>
<td>0.5</td>
<td>0.3</td>
<td>0</td>
<td>0</td>
<td>0.1</td>
<td>0</td>
</tr>
<tr>
<td>s279</td>
<td>0.5</td>
<td>0.3</td>
<td>0</td>
<td>0</td>
<td>0.1</td>
<td>0</td>
</tr>
<tr>
<td>s2710</td>
<td>0.5</td>
<td>0.4</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s2711</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s2712</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.4</td>
<td>0</td>
</tr>
<tr>
<td>s281</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s291</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s292</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s293</td>
<td>0.0</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.0</td>
<td>0</td>
</tr>
<tr>
<td>s2101</td>
<td>0.3</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.2</td>
<td>0</td>
</tr>
<tr>
<td>s2102</td>
<td>0.0</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.0</td>
<td>0</td>
</tr>
<tr>
<td>s2111</td>
<td>0.2</td>
<td>0.1</td>
<td>0</td>
<td>0</td>
<td>0.1</td>
<td>0</td>
</tr>
<tr>
<td>Loop</td>
<td>Load</td>
<td>Store</td>
<td>Gather</td>
<td>Scatter</td>
<td>Arithmetic</td>
<td>Reductions</td>
</tr>
<tr>
<td>-------</td>
<td>------</td>
<td>-------</td>
<td>--------</td>
<td>---------</td>
<td>------------</td>
<td>------------</td>
</tr>
<tr>
<td>s311</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s312</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s313</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s314</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s315</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s316</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s317</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s318</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s319</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s3110</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s3111</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s3112</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s3113</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s312</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s321</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s322</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s323</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s331</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s332</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s341</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s342</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s343</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s351</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s352</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s353</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s341</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s411</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s412</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s413</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s414</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s415</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s421</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s422</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s423</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s424</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s431</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s432</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s441</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s442</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s443</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>Loop</td>
<td>Load</td>
<td>Store</td>
<td>Gather</td>
<td>Scatter</td>
<td>Arithmetic</td>
<td>Reductions</td>
</tr>
<tr>
<td>------</td>
<td>------</td>
<td>-------</td>
<td>--------</td>
<td>---------</td>
<td>------------</td>
<td>------------</td>
</tr>
<tr>
<td>s451</td>
<td>28</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>s452</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>s453</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>s471</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>s481</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>3</td>
<td>0</td>
</tr>
<tr>
<td>s482</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>3</td>
<td>0</td>
</tr>
<tr>
<td>s491</td>
<td>4</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>s4112</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>s4113</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>s4114</td>
<td>3</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>3</td>
<td>0</td>
</tr>
<tr>
<td>s4115</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s4116</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>s4117</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>3</td>
<td>0</td>
</tr>
<tr>
<td>s4121</td>
<td>3</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>va</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>vpv</td>
<td>2</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>vtv</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>vptvs</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
<td>0</td>
</tr>
<tr>
<td>vptv</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>vppv</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>vtvtv</td>
<td>3</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>2</td>
</tr>
<tr>
<td>vbor</td>
<td>6</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>vif</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>vag</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>vas</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>vsumr</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>vdotr</td>
<td>2</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
</tbody>
</table>
Appendix D