Implementation of a Wait-Free Synhronization Primitive That Solves n-Proess Consensus Dror G. Feitelson Larry Rudolph Department of Computer Siene The Hebrew University of Jerusalem 91904 Jerusalem, ISRAEL Abstrat Several synhronization primitives were ompared in a reent paper by Herlihy [7th PODC, 1988, pp. 276{290℄, using their ability to implement waitfree shared data objets as a measure of relative strength. Wait-free n-proess onsensus was shown to be a useful generi problem, whose solution for dierent values of n reates a hierarhy of primitives. Well known primitives suh as test-and-set, swap, and feth-and-add, an only solve 2-proess onsensus, and thus annot be used to implement arbitrary wait-free objets. Stronger primitives that solve the n-proess version were also suggested in the paper; these inlude memory-to-memory swap, ompare-and-swap, and feth-and-ons. However, it is not lear that these primitives an be implemented in a satisfatory manner. In this note a new primitive that solves the wait-free n-proess onsensus problem is presented: the feth-and-onditionalswap. This primitive is easily implementable by a ombining network. Keywords: onsensus, read-modify-write, waitfree synhronization 1 Introdution Shared memory allows multiple proesses to interat using few data objets; for example, a broadast an be implemented by having the transmitter write to a shared ell and then having the reeivers read from it. The problem with suh a sheme is that if the transmitter is delayed for some internal reason, the reeivers must wait for it. The solution is to try not to use operations suh as broadast, where many proesses are dependent on the ativities of one other proess; rather, symmetri operations are used in whih all proesses play a similar non-ritial role. In addition, the interations should be wait-free, meaning that any proess is guaranteed to omplete its part in the interation regardless of the progress of the others. A omparison of some well-known primitives, based on the wait-free objets that they an implement, was reported reently by Herlihy [2℄. At the heart of his work lies the problem of wait-free nproess onsensus: n proesses eah start out with individual preferred values, and they must reah an agreement on one of these values. A naive solution would be to deompose the problem into two parts: rst hoose a leader, and then broadast the leader's preferred value. However, the hosen leader might go to a party to elebrate its vitory and forget to broadast the value, thus leaving his followers in suspense. Therefore the ations of hoosing the value and broadasting it must be ombined in some way into one atomi operation. Herlihy uses n-proess onsensus in two ways. First, he uses it to form a hierarhy of operations: 1. Read/write memory ells annot solve waitfree 2-proess onsensus. 2. Any non-trivial read-modify-write operation an solve wait-free 2-proess onsensus, but most annot solve 3-proess onsensus. This inludes test-and-set, swap, feth-and-add, and FIFO queues. 3. Several primitives do solve the general n-proess onsensus problem. These inlude memoryto-memory move and swap, ompare-and-swap, n-register assignment, feth-and-ons, and an augmented FIFO queue. Seond, he shows that primitives that solve the wait-free n-proess onsensus problem are universal, in the sense that they an implement any waitfree objet. The main problem with Herlihy's results is that the primitives he suggests are not so primitive; in Page 1 fat, it seems doubtful whether they an be implemented eÆiently at all. The purpose of this note is to present a new primitive that is simple to implement, and solves the wait-free n-proess onsensus problem; this is done in the next setion. The nal setion evaluates the signiane of this result. 2 Feth-and-Conditional-Swap The only way to implement a shared objet that supports highly onurrent aess by multiple proesses is by use of a ombining multi-stage interonnetion network [3℄. Suh networks are indeed utilized by researh prototypes like the NYU ultraomputer [1℄ and IBM RP3 [4℄. Other methods use arbitration mehanisms to perform only one of a set of onurrent requests addressed at the same objet, and thus fore serial exeution when onurrent aess is attempted. Combining proeeds as follows. The interonnetion network is a multi-stage paket-swithed network that onnets proessors to memory modules. Aess requests are taken to be read-modifywrite operations of the form < addr; f >, where addr is an address in the shared memory and f is a funtion that speies how to update the ontents of that address. Suh requests return the old ontents of the address. When two requests < addr; f > and < addr; g > direted at the same address arrive at the same network node, they are ombined and the unied request < addr; g Æ f > is forwarded to the relevant memory module (where g Æ f is the omposition of g and f, i.e. g Æ f(x) = g(f(x))). When the return value v is reeived by the node, it is propagated bak as a response to the rst request. Then f(v) is alulated, and sent as a response to the seond request. The nal eet is equivalent to a senario in whih the rst request was served before the seond one [3℄. Not all types of funtions an be ombined, however. Using the symbols of the above example, we an identify the following two requirements that must be fullled for ombining to be pratial: 1. The omposition of funtions g Æ f must be easy to ompute. 2. The representation of g Æ f should not require more spae than the representations of f or of g. The synhronization primitives suggested by Herlihy are not suitable for ombining: Memory-to-memory swap: this instrution ad- dresses two memory loations, and thus undermines the whole basis of a ombining network's operation. These networks an handle multiple requests that propagate from the proessors to the memory modules, but annot handle diret interation among the proessors or among the memory modules. Compare-and-swap: eah ompare-and-swap requires the address ontents to be ompared with another speied value; when suh requests are ombined all the values must propagate on, and atually all the work must be done serially at the memory module. Feth-and-ons: eah feth-and-ons requires another element to be added to the head of a list. Therefore when many are ombined, many new elements must be transferred with the request. The proposed primitive, feth-and-onditionalswap, is based on the observation that a restrited version of ompare-and-swap is suÆient for solving the n-proess onsensus problem. The initial value of the shared memory loation is ?. All the proesses hek for this same value, to see if they are rst; if so, they substitute it with their preferred value. If the value they see is not ?, they infer that another proess arrived before them; in this ase they just read its preferred value. Formally, feth-and-onditional-swap is dened as a read-modify-write operation suh that the funtion f assoiated with a memory aess request is x y: if x = ? then y else x; where x is the ontents of memory and y is the preferred value of the proess. Two requests with preferred values y1 and y2 are ombined by arbitrarily hoosing one of them and sending it on; assume it is y1 . When the return value v arrives, the two return values are generated as follows: if v = ?, return ? to the proess that had sent y1 and return y1 to the proess that had sent y2 . If it is not, return v to both. Obviously, the hardware needed to implement the ombining is small, and there is no overhead in the size of the ombined request. It should be noted that feth-and-onditionalswap is a restrited version of the data-level synhronization primitives suggested by Kruskal, Rudoplph, and Snir [3℄. These are primitives that perform an operation on a shared variable, onditioned on the state that the variable is in. If the number of Page 2 possible states is small, as it is in our ase, eÆient ombining is possible. This also explains why the full ompare-and-swap annot be ombined: it effetively uses the whole memory-word ontents as a state variable, and thus has an exponential number of states. 3 Evaluation The feth-and-onditional-swap primitive introdued in this note allows for a wait-free solution to the nproess onsensus problem; this is interesting in its own right, as onsensus is a basi problem in onurrent systems. In addition, it is a simple and pratial universal wait-free primitive, using Herlihy's onstrution for implementing feth-and-ons based on onsensus and then inplementing any waitfree objet by use of feth-and-ons. As Herlihy notes, however, this onstrution is too ineÆient to be onsidered useful for pratial purposes. Therefore is seems that the universality of the new primitive is of aademi interest only. It is natural to ask whether this deieny is spei to feth-and-onditional-swap, in whih ase other primitives ought to be sought, or maybe it is inherent in the onept of universal wait-free primitives. The answer probably lies in between, speifially in the use of feth-and-ons as an intermediate step in the redution of an arbitrary objet to a wait-free implementation using n-proess onsensus. The problem is that feth-and-ons implies aess to two shared memory loations at one: one is the pointer to the head of the list, whih is made to point at the new element, and the other is the \next" pointer of the new element, whih is made to point at the element whih was previously pointed at by the head. Thus feth-and-ons is similar to a memory-to-memory swap at a very basi level. Realization of a pratial universal wait-free primitive therefore requires a solution to one of the following problems: Referenes [1℄ A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. MAulie, L. Rudolph, and M. Snir, \The NYU Ultraomputer | designing an MIMD shared memory parallel omputer ". IEEE Trans. Computers C-32(2), pp. 175{189, Feb 1983. [2℄ M. P. Herlihy, \Impossibility and universality results for wait-free synhronization". In Pro. 7th Symp. Priniples of Distributed Computing, pp. 276{290, 1988. [3℄ C. P. Kruskal, L. Rudolph, and M. Snir, \Efient synhronization on multiproessors with shared memory ". ACM Trans. Programming Languages and Systems 10(4), pp. 579{601, Ot 1988. [4℄ G. F. Pster, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. MAulie, E. A. Melton, V. A. Norton, and J. Weiss, \The IBM researh parallel proessor prototype (RP3): introdution and arhiteture ". In Intl. Conf. Parallel Proessing, pp. 764{771, 1985. 1. Find an eÆient and pratial implementation of memory-to-memory swap. 2. Find a diret method to generate a wait-free implementation of a onurrent objet, using n-proess onsensus but avoiding feth-andons. Page 3