Comments

You are invited to share your thoughts on any of the following topics:

The material on this site. See site map.
The book Reasoning About Parallel Architectures. Order information.
Shared memory architectures in general. However, material specifically relating to explaining what hardware constructs lead to machines passing/failing specific tests in ARCHTEST should be sent to Dr. Naraig Manjikian (nmanjiki@eleceng.ee.queensu.ca) at Queens University in Canada, for inclusion in the technical forum maintained at the Computer Architecture Page .

There are four conditions. I may:

Exclude comments that I think do not fit.
Comment on the comments.
Edit comments. (But no editted submission will be posted without the consent of the commentator.)
Modify these conditions in the light of experience.

Please email comments (in ascii or html) to collier@acm.org. I will prepend a message number and the author's name and email address to each submission and will enter the submissions in the reverse of the order in which they are received (that is, most recent first).

98031701 Stale Data and a Failure to be CC1 W. W. Collier

98011001 Multiprocessors Should Support Simple Memory Consistency Models Mark Hill

98010104 Occam's Razor for Shared Memory Architectures W. W. Collier

98010103 Stale Data and a Failure to be CC1 W. W. Collier

98010102 A Rule with No Justification? W. W. Collier

98010101 Rules vs. Tests W. W. Collier

Stale Data and a Failure to be CC1

98010103 W. W. Collier, collier@acm.org

Stale data occurs when a machine changes the value of an operand, and then at a demonstrably subsequent time the value of the operand is fetched and the fetched value is found to be the old value, not the new value.

Test T5 has the potential to detect stale data on a machine which obeys RR. A circuit demonstrating a failure of Test T5 has the form:

     (P1,L1,W,1,A,S2) <cwr (P2,L1,R,1,A,S2)
                      <srw (P2,L1,W,1,B,S3)
                      <cwr (P3,L1,R,1,B,S3)
                      <rr  (P3,L2,R,0,A,S3)
                      <crw (P1,L1,W,1,A,S3)
                      =cc1 (P1,L1,W,1,A,S2)

Since the machine obeys RR, it must relax CC1. Consequently, the path

     (P1,L1,W,1,A,S2) <cwr (P2,L1,R,1,A,S2)
                      <srw (P2,L1,W,1,B,S3)
                      <cwr (P3,L1,R,1,B,S3)
                      <rr  (P3,L2,R,0,A,S3)

shows that stale data was visible (P1 wrote a 1 in A; later P3 read a 0 from A).

Relaxations detected by tests T6 and T7 do not conclusively show the existence of stale data. The circuits showing a relaxation of CC1 contain <cc1 twice. Breaking the circuit by removing the links connected by <cc1 leaves two separate paths, neither of which is sufficient to establish the visibility of stale data.

It is not clear whether or not other tests exist which demonstrate that a machine makes stale data visible.

Multiprocessors Should Support Simple Memory Consistency Models

98011001 Mark Hill, markhill@cs.wisc.edu

Many future computers will be shared-memory multiprocessors. These hardware systems must define for software the allowable behavior of memory. A reasonable model is sequential consistency (SC), which makes a shared memory multiprocessor behave like a multiprogrammed uniprocessor. Since SC appears to limit some of the optimizations useful for aggressive hardware implementations, researchers and practitioners have defined several relaxed consistency models. Some of these models just relax the ordering from writes to reads (processor consistency, IBM 370, Intel Pentium Pro, and Sun TSO), while others aggressively relax the order among all normal reads and writes (weak ordering, release consistency, DEC Alpha, IBM PowerPC, and Sun RMO).

This paper argues that multiprocessors should implement SC or, in some cases, a model that just relaxes the ordering from writes to reads. I argue against using aggressively relaxed models because, with the advent of speculative execution, these models do not give a sufficient performance boost to justify exposing their complexity to the authors of low-level software.

For further information, see:

%T Multiprocessors Should Support Simple Memory Consistency Models
%A Mark D. Hill
%D October 1997
%I Univ. of Wisconsin
%R Computer Sciences Technical Report #1353
%Y URL ftp://ftp.cs.wisc.edu/wwt/tr97_sc_case.ps
%X

Occam's Razor for Shared Memory Architectures

98010104 W. W. Collier, collier@acm.org

Let R be some rule of architectural behavior, such as cache coherence.

Imagine two machines, A and B, which are identical in instruction set, performance, color, and price, except that A obeys R and B violates R. How can the two machines be distinguished?

An engineer can always examine the circuitry of the two machines to determine which machine is which.

Can a programmer write a program to distinguish the two machines?

Either there is a program which behaves differently on the two machines, or there is no such program. If there is a program which distinguishes the two machines, then it must behave in the following way. The obedience to rule R by machine A, and the lack of obedience by machine B means that the set of results calculable by machine B is larger than the set of results calculable by machine A. Therefore, the distinguishing program can calculate at least one answer under machine B which it demonstrably cannot calculate under machine A.

If there is no such program, then adherence to rule R may be a matter of interest to the engineers who build machines, but it is not a matter in the slightest degree of any importance to any programmer or other user of the machines.

The computer architecture literature would be greatly clarified if any claim that one machine was different in its behavior on shared operands from another machine was supported by:

1. A statement that the difference was, or was not, visible to anyone other than engineers.

2. If visible, then either (1) a statement of what rules were obeyed by one machine and violated by the other, or (2) a description of what programs compute results on the one machine that cannot be computed on the other machine.

Stale Data and a Failure to be CC1

98010103 W. W. Collier, collier@acm.org

Revised 980317. Some time ago I argued that all failures to obey CC1 involved seeing stale data. I was wrong. Only Test T5 does. A more measured argument is at 98031701.

A Rule with No Justification?

98010102 W. W. Collier, collier@acm.org

Many consider CC3 to be the appropriate form of cache coherence for machines to obey. The argument is that:

CC1 is useful only in very few situations, and it causes too large a performance drop (compared to CC3) to justify making it available.
Programmers can achieve atomicity of arbitrarily large granularity through the use of semaphores and critical sections.
In those few cases where semaphores cannot be used (in the semaphore routines and other low level operating system routines), programmers can use strong hardware instructions (compare and swap, fetch and op, exchange, etc.) to achieve write atomicity and to interact successfully with other routines simultaneously operating on the same data.

Note this: exactly the same discipline that can be used to successfully program a CC3 machine can also be used to successfully program a CC4 machine. Thus the argument that justifies CC3 over CC1 can be used to justify CC4 over CC3 and CC1.

Thus, it appears that CC3 is a rule with no justification. CC3 makes a machine slightly faster, but slightly harder to program than a CC1 machine. CC4 makes a machine slightly faster than a CC3 machine, but no harder to program.

Rules vs. Tests

98010101 W. W. Collier, collier@acm.org

As a programmer I want the simplest possible description of a machine's behavior. A list of the rules that a machine obeys on shared data would be perfect. The rules are finite in number, well known, and simple. Further, the descriptions of the rules have not changed fundamentally since being formulated in the late 1970's.

(In contrast to the rules, the tests are without limit. There can be any number of tests, thrashing operands around in ever more intricate patterns.)

In an ideal world engineers would design a machine with the goal of implementing certain rules, would prove that the design achieved the goal, and would then announce the rules to programmers.

In the real world the closest approach to certainty possible is achieved through multiple, imperfect efforts at verification. Testing is one such effort; you can fish in a lake all day and not catch any fish, but you still will not have proven that the lake contains no fish.

Testing machines to determine their behavior on shared operands begins with identifying distinguishing architectures, and the quest for distinguishing architectures begins by asking two questions:

What are the rules of machine behavior which programs can observe?
What programs can observe those rules of behavior?

These questions are too simple, since programs cannot detect compliance, but only violations, of rules. More accurately, the questions would ask:

What are the rules of machine behavior whose violation programs can observe?
What programs can observe those violations?

This is still misleading. Programs do not detect a violation of an individual rule; they detect a violation of a set of rules. The questions should be:

What are the sets of rules of machine behavior whose violation programs can observe?
What programs can observe a violation of those sets of rules?

In each of the above three pairs of questions the first question focused on rules and the second on programs. But the two questions in each pair are very closely related. Thus it may be valid to focus on program results as a way of describing machine behavior.

This has already started. I found it natural to summarize the results of running the tests in terms of which tests were passed or failed, not in terms of rules obeyed or disobeyed. Licensees of ARCHTEST have asked for help in explaining why machines pass/fail one or another of the tests in ARCHTEST, not why they obey/violate particular rules.

Perhaps, in the future, we will describe a machine's behavior, not in terms of the simple sequences of events described in today's rules, but in terms of the more complex patterns of events contained in the tests. And, instead of writing shared data routines assuming just the simple rules of order and coherence, we will write shared data routines which will run successfully on machines which are known not to permit certain patterns of events.

Here is a scenario in which this might be useful. A large database program runs on a relaxed SMMP. The occasions when two processes actually compete for the same item of data are extremely infrequent. Calls to the system semaphore routines are time consuming; even worse, they are mostly a waste, given the infrequency of real contention. Therefore, programmers write their own critical section routine to guard access to items of data. The routine works only if it contains none of the sequences of code which reveal relaxed behavior of the machine.

Send email to: William W. Collier.

Comments

Stale Data and a Failure to be CC1

Multiprocessors Should Support Simple Memory Consistency Models

Occam's Razor for Shared Memory Architectures

Stale Data and a Failure to be CC1

A Rule with No Justification?

Rules vs. Tests

Site Map

What's New?

References