About site: Programming/Languages/Synchronous - Unification of Synchronous and Asynchronous Models for Parallel Programming Languages
Return to Computers also Computers
  About site: http://cobweb.ecn.purdue.edu/~hankd/CARP/XPC/paper.html

Title: Programming/Languages/Synchronous - Unification of Synchronous and Asynchronous Models for Parallel Programming Languages Thesis proposing parallel language, based on C, that lets programmers explicitly specify and manage parallelism on a broad class of architectures. [Purdue University]
State_Threads_Library A small application library for writing fast and highly scalable Internet applications on UNIX-like platforms. [Open source, MPL or GPL]

Vision_Numeric Type3 software for engraving and sculpting of fonts and graphic logos with CNC or laser cut machines [requires flash].

Freedesktop_org A free software project to work on interoperability and shared technology for desktop environments for the X Window System.

i3d_Themes FrontPage 2000/2002, Dreamweaver, and Flash templates. Free custom graphics services for company Logos, page banners and Swish intros Frontpage tips, tutorials and live on-line support.

JournalToGo_com Delivers customized clinical abstracts and healthcare news. Includes tour, list of channels, and company information.

RFC_2592 Definitions of Managed Objects for the Delegation of Management Script. D. Levi, J. Schoenwaelder. May 1999.


  Alexa statistic for http://cobweb.ecn.purdue.edu/~hankd/CARP/XPC/paper.html





Get your Google PageRank






Please visit: http://cobweb.ecn.purdue.edu/~hankd/CARP/XPC/paper.html


  Related sites for http://cobweb.ecn.purdue.edu/~hankd/CARP/XPC/paper.html
    drcauto Developers of the Smart Architect range of architectural extensions to AutoCAD and AutoCAD LT, as well as a 3D modeller, and a LISP enabler for AutoCAD LT.
    Uversa_Consulting Linux consulting for businesses in the Phoenix, Arizona.
    Draft_-_Multiple_Signatures_using_Security_Multiparts Describes how the Security Multiparts defined in RFC 1847 can be used to transport multiple digital signatures.
    K-Computer Reseller of Sage and Pegasus accounting software and Mainstay and IGS hotel software. Features company profile.
    Schwichtenberg,_Holger Blog, news about .NET, ASP.NET and Visual Studio .NET.
    Trying_Sun_Solaris_for_Intel_x86 Introduction and installation notes for the Intel platform; many related links.
    A_Short_History_of_Cryptography A brief historical analysis of the management of cryptography.
    Ted_Nelson,_Hypertext_Pioneer ZDTV article describing his work on Xanadu. (August 12, 1998)
    Mailcrypt__An_Emacs/PGP_Interface Encrypt/decrypt mail with PGP 5.0 or GnuPG.
    DocuLogic The DocuLogic Virtual Records Center online document management system provides imaging, indexing and secure storage and retrieval over a customized, firewall protected, SSL-encrypted web-based interf
    Stealth_Message Stealth Message offers a free online messaging system designed for communicating sensitive and confidential information. It ensures your privacy, allowing you to communicate in complete confidence wi
    MNG Is the animation extension to the popular PNG image-format. It brings the ease of GIF animations, WITHOUT the patent licensing issues. Libmng is the MNG library (free software by Gerard Juyn).
    Java_is_Afraid_of_Unions By Rajesh Patkar. Unions give different views of looking at the same memory location.
    Bee_Ware Provide a web application filtering, authentication and intrustion prevention systems.
    Game_Developers_Java_Users_Group_(GameJUG) Mailing lists signup, membership information, resources, and links.
    RFC_0364 Serving Remote Users on the ARPANET. M.D. Abrams. July 1972.
    Music_Study [Win-Mac] Ear training and music theory instruction software, by Dr. Gilbert Trythall.
    A_to_Z_of_C The non-profit free book on C/DOS/Turbo C programming with 79 chapters. It's available online to view/download.
    Studio_Deluxe_International Offers site design, graphics, and hosting.
    Extracting_Patterns_and_Relations_from_the_World_Wide_Web Paper by Sergey Brin presenting a technique which exploits the duality between sets of patterns and relations to grow the target relation, starting from a small sample. [PDF] (November 11, 1999
This is websites2007.org cache of m/ as retrieved on 2008.09.07 websites2007.org's cache is the snapshot that we took of the page as we crawled the web. The page may have changed since that time.
UNIFICATION OF SYNCHRONOUS AND ASYNCHRONOUS MODELS FOR PARALLEL PROGRAMMING LANGUAGES

UNIFICATION OF SYNCHRONOUS AND ASYNCHRONOUS MODELS FORPARALLEL PROGRAMMING LANGUAGES

A ThesisSubmitted to the FacultyPurdue UniversitybyMichael J. PhillipIn Partial Fulfillment of theRequirements for the DegreeMaster of Science in Electrical EngineeringMay 1989This thesis is dedicated to Mitzi.

ACKNOWLEDGMENTS

I would like to thank my thesis adviser, Dr. Henry Dietz,for his encouragement, expertise, and always open door.I would also like to thank Dr. Tom Casavant and Dr. DaveMeyer for serving on my committee and providing invaluablefeedback.I thank Dr. H. J. Siegel and Wayne Nation for their inputregarding the past, present, and future status of PASM.A special thanks to Dr. James Kuehn for the many hours ofguidance and constructive conversations regarding compilersand Parallel-C.Finally, I thank Dr. Paul Schneck, Dr. John Riganati, andthe research staff at the Supercomputing Research Center forproviding the opportunity and environment to develop asuitable thesis topic.

ABSTRACT

Phillip, Michael J., M.S.E.E.,Purdue University. May 1989. Unification Of Synchronous AndAsynchronous Models For Parallel Programming Languages. MajorProfessor: Dr. Henry G. Dietz.Most parallel languages can be associated with either asynchronous or asynchronous programming model. Synchronouslanguages tend to infer semantics from a particularexecution model, while asynchronous languages are oftendifficult to implement efficiently on existing hardware.Each model encompasses a different form of parallelism, butneither is capable of expressing both synchronous andasynchronous parallel constructs.In this thesis, a number of existing parallel languages areexamined with respect to the precise semantics of theparallel constructs and their impact on the type ofparallelism that can be expressed in the programming model.From this examination, a unified programming model isderived that allows parallel programs to be mappedefficiently onto both SIMD and MIMD architectures.Fundamental to this model is the separation ofsynchronization from control flow and communication issues,as well as the capability to allow nondeterminism inparallel algorithms.As a concrete illustration of the unified model, anexplicitly parallel C language, XPC, is proposed. A BNFgrammar for the language is given, and the semantics ofparallel constructs are informally discussed. XPC extendsall communication and control constructs of the Cprogramming language to allow concise specification andmanagement of both synchronous and asynchronous parallelism.

Notice for HTML Users

The HTML version of this thesisdiffers from the official copy in severalways, including a variety of minor formatting changes.The official copy is available through the usualchannels for a Purdue MS Thesis.However, this HTML version is complete and very convenientfor browsing.Most WWW browsers will even allow you to print this versionby automatically converting it into postscript.This HTMLdocument was generated using a script that replaces figureswith in-line gif images that are hypertextlinked to full-resolution postscript versions; the(nK .ps.Z) at the upperright of each in-line image provides both the size of thecompressed postscript file and the hypertext link. Thereare also a variety of hypertext links inserted fornavigation within the document, including a Hypertext Indexat the end of this document (click here to jump to the HypertextIndex).

1. Introduction

As parallel processing systems become more widely availableover the next several years, the need for adequate softwaredevelopment environments will become increasingly visible.Programming languages proposed or implemented thus far tendto support a particular type of parallelism and are oftenclosely associated with a particular architecture. Issuessuch as scalability and portability -- fundamental insequential languages -- are often ignored in parallelprogramming models. There is a need to provide theprogrammer with a model that encompasses many forms ofparallelism, while still retaining the "feel" of a morefamiliar sequential language.To this end, this thesis proposes a parallel language, basedon the C programming language, that will allow programmersto explicitly specify and manage parallelism on a broadclass of architectures. A foundation for such a language isbuilt by first examining a number of explicitly parallellanguages with respect to the semantics of parallelconstructs and the interaction of such constructs duringprogram execution. From this examination, aspects of aunified programming model are derived, and inconsistenciesin existing language semantics are exposed. This unifiedmodel provides a unique perspective of parallel programming,and identifies language features that are desirable in ageneral-purpose parallel programming language. 1.1. Parallel Programming Models The sequential model of programming has spawned a greatnumber of languages and algorithms, all tied intricately tovon Neumann style architectures. With the evolution ofparallel hardware has come the introduction of languagesthat support the specification and management ofparallelism. These languages have found moderateacceptance, but remain difficult to use due to a number ofshortcomings including a lack of machine independence, alack of scalability, and a lack of programming support, suchas source-level debugging. Hence, the primary challenge ofhardware and software designers alike is to provide anenvironment where parallel programs can be designed,implemented, and debugged by programmers who have bothnatural and learned tendencies to decompose problems in asequential manner. Thus far, three approaches have beenpursued in an attempt to meet this challenge:Synchronous parallel modelAllow parallelism only within a given "step" of a solution.A program specifies a sequence of such steps, withparallelism facilitated by simultaneous execution of eachstep in many processors. This approach is often associatedwith a SIMD execution model.Asynchronous parallel modelRepresent a program as a set of sequential processes thatoccasionally interact, but otherwise proceed independently.A program's processes are written based on the assumptionthat all processes execute asynchronously. If one processneeds information from another, it may be forced to wait forthat information to become available, but no other processesare affected by this interaction. This model is generallyused in conjunction with MIMD execution models.Sequential modelWrite an "ordinary" sequential program and have an automatedsystem convert it into parallel code. A program is firstanalyzed to determine a partial ordering of its operationson data that will preserve the correctness (meaning) of thesequential program. This analysis can be performed oneither "dusty decks" of existing code [KuS84] [Ell85] or on asequential language that has been designed or modified tofacilitate this analysis, such as Sisal [McS85] , Blaze [MeV85] , or Refined C [DiK84] . The operations from thepartial ordering are then "packaged" into parallel-executable processes appropriate for the target machine.Although this approach may provide the eventual solution,the complex and abstract nature of the required analysis isbeyond the scope of this discussion.It is interesting to note that all three approaches map theproblem of programming for parallel execution into asequential domain. Synchronous models allow the problem tobe decomposed into a series of small steps that may beexecuted in parallel, while asynchronous models treat theproblem as a set of nearly independent sequential tasks thatcan proceed concurrently. Although the third technique canpotentially be applied in conjunction with the first twoapproaches, the synchronous and asynchronous models havetypically been considered to be mutually exclusive.Clearly, this exclusion is not inherent in the problemitself, but is instead a result of applying a particular setof constraints to the formation of algorithms that map theproblem onto a particular class of machines. This view hasbrought about an important dilemma:if a solution contains both synchronous (SIMD) andasynchronous (MIMD) parallelism, and the underlying hardwarecan support both models, how can the programmer specify bothtypes in a single programming language?Implicit in the statement of this dilemma is an assertionthat many current parallel architecturescan, in fact, support both types ofprogramming models. It is ironic, then, that the emphasisin discussions of parallel programming models is most oftenplaced on whether a synchronous or asynchronous model shouldbe chosen. If parallelism is the only way to achieve highereffective speed, it is no longer acceptable to discard oneflavor of parallelism simply because the programminglanguage uses a different model of computation. 1.2. Parallel Execution Models In order to better understand how the synchronous andasynchronous constructs differ, a brief description of themachine models most closely identified with each languagetype is presented below. A number of other hardwareconfigurations exist in parallel architectures, but the SIMDand MIMD models provide the most natural division for thediscussion of programming models to be presented in thisdocument. 1.2.1. SIMD As originally defined [Fly66] SIMD refers to a Single Instruction stream, Multiple Datastream model of execution. In such a model, only a singlesequence of instructions is present, but each instructionmay be simultaneously executed in an arbitrary set ofprocessors, with each processor operating upon its own data.A typical block diagram for a SIMD machine is shown inFigure 1.1.Image1 not displayable as text. ( 7K .ps.Z ) Figure 1.1. SIMD block diagramTypically, program instructions are fetched and decoded bythe control processor (CP). Operations to be performed inparallel are then broadcast to those processing elements(PEs) that have been enabled for the current instruction.Processors that are disabled must remain idle while theinstruction is executed in enabled PEs. The parallelism ina SIMD model is at the lowest level of the programhierarchy, or the instruction level. Thus, SIMD machinescan efficiently implement fine-grain parallelism.This efficiency is due to the "perfect"scheduling of operations that is possible in a SIMD model;no interference occurs between processing elements as theprogram is executed. 1.2.2. MIMD A Multiple Instruction stream, Multiple Data stream(MIMD) machine is characterized byseveral sequences of instructions, or processes, that aredistributed in some manner on a set of processors. Theseinstruction streams may differ in both content andstructure, and can effectively be viewed as being executedby a set of uniprocessors that are able to communicate witheach other through use of special software and hardwaresupport. A generic block diagram for a MIMD model isillustrated in Figure 1.2.Image2 not displayable as  text. ( 9K .ps.Z ) Figure 1.2. MIMD block diagramThe means of communication between processors is generallyused to conceptually divide MIMD architectures into a numberof subclasses. For example, some memory models have ashared address space to facilitate communication, whileothers must send messages through a network. Thesecommunication techniques are also used in SIMD executionmodels, but are typically "hidden" from the programmer.Thus, since processes need not be executing the sameinstruction at the same time, and since communication usesup valuable execution time, large-grain parallelism isgenerally preferred in the programming model. The delaysdue to communication are also present in SIMD executionmodels, but the relative timing between the participatingprocesses can be determined statically. 1.3. A Unified Model of Parallelism In this document, the emphasis is not placed on acompetitive evaluation of synchronous and asynchronousprogramming models; clearly, both encompass desirable formsof parallelism. Instead, the goal is to examine the impactof several synchronous and asynchronous language features onthe ability to combine desirable aspects of SIMDand MIMD machine models. Thedivergence of SIMD (synchronous) and MIMD (asynchronous)programming models has by no means been accidental, butrather the result of differing design goals andimplementation constraints.A SIMD model of execution leads to hardware that isrelatively straightforward to build and reasonably easy tounderstand. Most SIMD machines simply extend von Neumannarchitectures to allow simultaneous operation of multiplefunctional units. As a result, most synchronous languageshave been built around a particular machine, providing anear optimal match between language constructs and theunderlying hardware. Examples of such language/machinepairs include GLYPNIR [LaL75] / Illiac IV [BaB68] , Vector-C [Li84] / Cyber 205 [Lin82] , Parallel-C [KuS85b] / PASM [SiS86] , and C* [RoS87] / Connection Machine [Hil85] . Ineach of these cases, at least some aspect of the languagesemantics have been inferred from the execution model.A MIMD model, by contrast, is theoretically much morepowerful, but is often more difficult to implement inhardware. This is especially true if communication issupported with shared memory hardware. Asynchronouslanguages tend to be designed as tools for examiningabstract theories of computation; hence, they often do notmap well onto parallel hardware. In many respects, the SIMDmodel is actually a proper subset of the more general MIMDmodel. 1.4. Overview of This Document As indicated above, explicitly parallel languages tend tofall into one of two classes based on the execution model ofthe target architecture. These classes have generally beenconsidered incompatible, and tend to favor certain types ofapplications and specific modes of parallelism. Thisdocument seeks to identify common features of both languageclasses, with the goal of specifying a unified programmingmodel that can be applied to a broad range of parallelalgorithms and machines. The approach taken is to firstexamine existing languages from the perspective of theprogramming model, rather than a particular implementation.Although descriptions of these languages have appearedelsewhere, discussion typically revolves around thelanguage/machine pair, thus obscuring issues related to theunderlying programming model. When taken out of the contextof a particular machine environment, inconsistencies in thelanguage semantics often appear, making it difficult for thelanguage to be implemented and used on other systems. Thediscussion of existing languages is not intended to becomprehensive, but does attempt to describe in some detailthose constructs that affect the expression and use ofparallelism. The manner in which synchronous andasynchronous language classes are compared is believed to beunique, as is the attempt to unify the two programmingmodels.Chapter 2 surveys four synchronous programming languageswith respect to the semantics of parallel constructs thatsupport computation on SIMD architectures. The survey notonly describes the salient features of each language, butalso points to inconsistencies in the model of parallelismused within individual constructs. An attempt has been madeto divide constructs into several categories, based on theimpact each category has on parallel execution. Thus eachlanguage discussion contains subsections covering theexpression of parallelism, the use of conditional anditerative constructs, and communication between processorswithin the system, as well as a brief summary of thestrengths and weaknesses of the model.Chapter 3 addresses asynchronous languages, ranging frommessage passing models to shared memory paradigms thatsuppress explicit naming of processes. As in Chapter 2, thesurvey attempts to focus on semantics of constructs that aredesirable in a unified programming model. Specifically,subsections cover the specification and management ofprocesses and the means by which processes communicate withone another.The synchronous and asynchronous models examined in theprevious chapters are merged into a unified parallelprogramming model in Chapter 4. This unification takesplace by extracting language features from previouslydescribed languages and providing a possible set of moregeneral semantics for parallel computation. The key to thespecification of a unified programming model is theseparation of interdependent characteristics that are ofteninherently linked in synchronous or asynchronous models.Thus, issues such as synchronization and communication aretreated separately, leading to a discussion of determinism,a fundamental aspect of parallel programming models.Chapter 5 presents a concrete illustration of the conceptsproposed in Chapter 4 by describing a new parallel language,XPC. The language is similar in style to many of thelanguages previously discussed, but the semantics ofconstructs are independent of the execution model of thetarget architecture. The language is also unique in that itis intended to be compatible with current compilertechnology, including automatic parallelization.Finally, Chapter 6 summarizes the work presented in thisthesis, and outlines future research directions in the areaof parallel language development.

2. SYNCHRONOUS LANGUAGES

2.1. Synchronous Constructs As described in the previous chapter, most explicitlyparallel languages are associated exclusively with either asynchronous (SIMD) or asynchronous (MIMD) model ofexecution. Constructs added to synchronous languages tosupport parallel execution can be divided into two classes:constructs that deal with the expression of parallelism, andthose that handle assignments and communication betweenprocessors. This chapter will examine both classes ofconstructs for a series of four synchronous languages.Synchronous languages allow parallelism to be expressed inthe following ways: 1. variable declarations (data structures and data layout) 2. enabling/disabling of PEs (algorithm topology)Variables are usually declared as either being distributedamong all processors or local to the control processor; thekeywords differ from language to language, but the meaningis the the same. For example, Glypnir uses the descriptorCU to define data and variables that reside in the controlunit, and PE to designate parallel variables for theprocessing elements. In Parallel-C, the keywords for suchdeclarations are scalar and parallel, respectively.Algorithm topology refers to the manner in which a programmaps onto a parallel execution model, including the controlflow and data flow of an algorithm across many processors.Ideally, algorithm topology is independent of machinetopology, at least within a given class of parallelarchitectures.Communication between processors is deceptivelystraightforward in synchronous languages. Given that datain SIMD machines is classified as being either scalar orparallel, most synchronous models specify semanticsgoverning each combination of data classifications. Somelanguages only allow certain types of assignments; othersattempt to support arbitrary communication between allprocessing elements, including the control processor.Inherent in all such semantics, however, is the provisionfor synchronization within communication constructs. It isthis linkage of potentially independent issues that preventsmost synchronous languages from being implemented on otherclasses of architectures, and often introducesinconsistencies into the programming model. 2.2. Glypnir Glypnir [LaL75] was developed as a programming language forthe Illiac IV parallel processing system [BaB68] . At thetime of its development, compiler technology could notprovide the optimization necessary to allow both efficientexecution and topology independence. Thus, a Glypnirprogram specification is closely associated with the IlliacIV hardware. The Illiac IV consists of 64 processingelements (PEs) and a single control unit (CU). Eachprocessing element may directly access 2048 words of memory,and may indirectly access any address within the system.Interprocessor communication can take place through aninterconnection network that allows PEs to directlycommunicate with other processors that have logicaldistances of +-1 or +-8. 2.2.1. Expression of Parallelism Glypnir allows parallelism to be expressed through variabledeclarations and the use of conditional and iterativeconstructs. Most standard variable types are supported inGlypnir, including real (floating point), integer (fixedpoint), and alpha (character). Aggregate variables ofthese basic types can be defined with the vector keyword,and the pointer directive allows memory indirection to beexpressed in conjunction with any of the above types.In order to support parallelism at the level of variabledeclarations, Glypnir uses the descriptor cu to define dataand variables that reside in the control unit, and pe todesignate parallel variables for the processing elements.The format of these declarators is as follows:/* scalar variable in control unit */cu integer i/* parallel array in PEs */pe real vector v[10]Parallel variables in Glypnir are of a fixed size andextent, resulting in parallel data layout that closelyresembles the Illiac IV topology. These data structures arereferred to as swords (superwords) and slices,representing one word of data from each of the 64 processingelements of the Illiac IV. A group of 64 words, each from adifferent PE, but with the same address within that PE, iscalled a sword. A slice is similar to a sword, but theaddresses within a PE need not be identical. Thus, thevariable v above actually declares a vector of 10 swords,or, equivalently, a two-dimensional "array" (64 x 10) ofintegers. If the vector is indexed with a cu variable, theresulting value is a sword. If the index is instead a pevariable, the result is a slice; each PE uses its own valueof the index to access an element of the vector.An additional data type, boolean, is provided for use inconditional expressions, which are described in thefollowing section. 2.2.2. Conditional Constructs In Glypnir, conditional constructs have a style similar tothose found in Algol, with the semantics extended to supportcontrol of identical parallel instruction streams. Theformat of these statements is as follows:if <BE> then <S1> else <S2>for all <BE> then <S1>go to <LABEL>The notation <BE> represents aBoolean expression consisting of one Boolean valuefor each of the 64 PEs in the machine model of the IlliacIV. Thus, all conditional constructs in Glypnir are treatedas parallel statements, with each PE evaluating its ownelement of the Boolean expression. The semantics of theseconstructs specify that statement <S1> is to beexecuted in those PEs where <BE> is evaluated as beingtrue, followed by execution of statement <S2> (whenpresent) in PEs where <BE> is determined to be false.The semantics of the for all statement are identical tothose of if-then; the statement simply provides analternative set of mnemonics. Nested conditional statementsare supported by means of a run-time execution stack thatcontains a properly nested sequence of the Booleanexpressions encountered at each level of the conditionalhierarchy. The Boolean expressions can therefore be viewedas enable patterns, and each level of nesting results in alogical AND operation of the current Boolean expression withthe inherited enable pattern. While this semanticinterpretation is not completely general for parallelcomputation, it is analogous tothe semantics of conditional constructs in a sequentialprogramming model. Glypnir also addresses the issue of goto statements inside of conditional constructs, presenting asolution that is consistent, but incomplete. Whenever a goto statement appears in the body of <S1> or <S2>of a conditional statement such as those above, a potentialproblem arises.Consider the following code sequence (where x, y and max aresword variables):if x > y then begin max <- x go to LABEL1end else begin max <- y go to LABEL2endThis would appear to imply that some PEs branch to LABEL1while others branch to LABEL2, thus violating the imposedSIMD model of execution. To resolve this apparent conflict,the Glypnir semantics require thatall PEs execute the go to statement in <S1> ifany PE does. Therefore, if theexpression x > y is evaluated to be true in any of the 64PEs, then all PEs will branch to LABEL1 after assigning thevariable max equal to the value of x. It is not clearwhether this assignment takes place in all PEs or only inthose where x > y, but it is clear that statement<S2> in the else clause will not be reached in any PE.As a result, these semantics can lead to confusing (butpredictable) code segments that have no analog in asequential language. As will be shown later, thesesemantics are overly restrictive, and can be relaxed toallow a more general set of conditional constructs to bespecified.Another problem arises when the target of a branch islocated inside of a conditional or iterative construct. Forinstance, in the above example LABEL1 may actually refer tocode in the body of another if statement. It is difficultto infer the semantics of such an occurrence from theGlypnir specification. 2.2.3. 3 Iterative Constructs Glypnir provides a number of iterative control constructs,each extending the semantics of conditional statements tosupport repeated execution of a block of code. Thefollowing are valid iterative statements in Glypnir:loop i <- i1, i2, i3 do <S>thru i do <S>for i <- i1 step i2 until i3 do <S>do <S> until <BE>while <BE> do <S>Like the Fortran do loop, the Glypnir loop statementperforms a comparison of the control variable (i) with aconditional limit (i3), with the statement <S> beingexecuted repetitively until the condition i > i3 is met.Unlike Fortran, however, the conditional test is performedbefore the loop statement isexecuted. The control variable (i) must be a cu integervariable, while the initial value, increment, and limit (i1,i2, and i3, respectively) can be arithmetic expressions thatyield a single value - swords and slices are disallowed.The Glypnir for statement is similar to the loop statement,except that the control variable, initial value, incrementand limit may be swords. As a result, the statement<S> may be executed a different number of times indifferent PEs, with each PE using its own values of thecontrol variable, initial value, increment, and limit.The thru statement repeats execution of the statement<S> for the number of iterations specified by the cuvariable (or constant) i.The do and while statements in Glypnir allow explicitparallelism in loops by requiring that a Boolean expressionbe specified as the conditional test. A do loop in Glypnirwill repetitively execute the statement <S> until theBoolean expression <BE> is false, and the while loopwill execute <S> as long the Boolean expression istrue. Thus, the statement <S> in a do loop will beexecuted one or more times, whereas the body of a while loopis executed zero or more times. 2.2.4. Assignments and Communication Assignment statements in Glypnir are relativelystraightforward, since all parallel variables must be of thesame extent. Arbitrary communication between PEs issupported through the use of pointers, which may either belocal to a specific memory module or global to the systemmemory. Pointer variables may be either a single cu pointeror a sword of pe pointers. A number of allocation anddeallocation statements are provided in the language tosupport the use of pointers. 2.2.5. Summary Glypnir was an early attempt to provide high-levelprogramming support for a particular parallel architecture.Clearly, an engineering approach was used in the design ofthe language, since most tradeoffs in the language favor apractical implementation over conceptual elegance in theprogramming model. Perhaps foremost of these tradeoffs wasthe choice to generate the most efficient code that compilertechnology was capable of at the time, at the cost of losingtopology independence. No attempt is made to differentiatebetween the programming model of Glypnir and the Illiac IVmachine model. The lack of parallel data abstraction is awell-documented problem with the language. There are alsoproblems with the control flow model, as described above,but the language does attempt to provide a complete,abstract control structure.Communication between processors is highly machine-dependent, but the inclusion of pointers as a means ofinterprocessor communication provides a bit of foreshadowingfor subsequent languages based on the C programming model.Of course, Glypnir pointer variables must be identified atthe point of declaration, thus avoiding some of theambiguous aliasing issues present in C, where the address ofany object may used in expressions and assignmentstatements.Glypnir provides many insights into the fundamental issuesof language specification for parallel architectures.Although there are flaws in the language model, it remains avaluable starting point in the study of synchronousprogramming languages. 2.3. Actus A few years after the introduction of Glypnir, Actus [Per79] was proposed as a general purpose synchronous programminglanguage for a number of SIMD machines. The primary goal inthe development of Actus was to provide an explicitlyparallel programming language that was independent of boththe underlying machine architectureand existing parallel algorithms. Instead, Actusallows the programmer to concisely specify the parallelismof a problem, provided that the problem can be specifiedwithin the bounds of a synchronous programming model. Thelanguage is based on Pascal, with strong typing andstructured programming constructs extended to supportparallelism. 2.3.1. Expression of Parallelism In Actus, only array data structures can be declared to haveany extent of parallelism, although the parconst keyword isincluded to declare parallel constants. An Actus array maycontain both serial and parallel dimensions, as illustratedbelow:const P = 100; S = 10;var (* parallel *) pvar : array [1:P] of integer; (* scalar *) svar : array [1..S] of integer; (* mixed *) grid : array [1..S, 1:P] of real;In this example, P denotes the maximum extent of parallelismthat may be assumed by the variables pvar and grid.Parallel arrays may be defined for any data type in thelanguage, including structured types such as Pascal recordsand sets. The use of the ":" character signifies that theassociated index is to be interpreted as the paralleldimension of the array. Actus allows only a singledimension of parallelism to be expressed for an array,although no such limitation is placed on the number ofscalar dimensions permitted. Thus, Actus provides a meansof defining all Pascal variables and types for a one-dimensional array of processing elements. The conceptualview of parallel data structures is quite similar to thatpresented in the Glypnir programming model. Unlike Glypnir,however, the extent of parallelism in Actus is not tied toany particular machine.Actus introduces the notion ofindex sets as a mechanism to specify the extent ofparallelism in statements and expressions. Through the useof index sets, the extent of parallelism can be changed eachtime a data structure (array) is referenced. Index sets aredeclared much like ordinary variables, but must specifyconstants as lower and upper bounds. Processors may beselectively omitted from an index set by taking the union(+), intersection (*), or difference (-) of two or moreindex sets. Regular patterns may be expressed through theuse of square brackets and a constant increment. Thefollowing example illustrates the declaration of index setsin Actus:index (* odd PEs from 1 to 99 *) odd = 1:[2]99; (* PEs 6 to 8 and 25 to 32 *) subset = 6:8 + 25:32; 2.3.2. Conditional Constructs In many respects, Actus is similar to Glypnir in itshandling of conditional constructs; both languages derivetheir semantics from sequential counterparts of similarappearance. The following are valid conditional constructsin Actus:if <E> then <S1>if <E> then <S1> else <S2>case <CS> of <L1>: <S1>; <L2>: <S2>; ... <LN>: <SN>end;The key difference between the Actus semantics and those ofGlypnir involves the specification of the conditionalexpression. In Actus it is not necessary that theexpression <E> be evaluated in all PEs. In fact, theexpression need not exhibit any parallelism at all. If theconditional expression is scalar, however, the extent ofparallelism for any variables in the body of statements<S1> and <S2> must be explicitly specified orinherited from an enclosing construct of known extent. Thisinheritance can take place either through a nestingmechanism similar to that of Glypnir, or by use of a withinconstruct. The within construct has the formwithin <EOP> do <S>where <EOP> represents the extent of parallelism forthe statement <S>. When the conditional expression isparallel, it is evaluated for each element present in thecurrent extent.Actus does not directly address the issue of goto statementsappearing inside the body of conditional constructs, optinginstead to disallow their use. This solution is certainlyself-consistent, but results in some loss of generality inthe programming model. 2.3.3. Iterative Constructs Actus provides three iterative constructs, corresponding tothe Pascal while, repeat, and for statements.while <E> do <S>repeat <S> until <E>for <INDEX> := <LOWER> by <INC> to <UPPER> do <S>As with the conditional constructs, the extent ofparallelism appearing within the body of iterativestatements must be explicitly specified or inheritedwhenever the conditional expression <E> is scalar.When the conditional expression contains parallel variables,the body of the iterative statement takes on the extent ofthe conditional expression. Thus, the while loop becomesmore flexible than the repeat-until loop in Actus due to thefact that the conditional expression in the while loop isevaluated before execution of the statement body, allowingthe extent of parallelism in the loop to vary overiterations. As in Glypnir, the extent of parallelism maynever increase - once a processing element drops out of aloop it must remain idle until all other active processorshave completed execution of the loop.The semantics of the for statement differ somewhat fromthose of the previously described iterative constructs. Inan Actus for loop, each of the parameters (index, lowerbound, increment, and upper bound) may be parallel, providedthat the extent of parallelism is specified or inherited.By allowing the index to be parallel, the for loop may betransformed into a number of potentially independent loopsexecuting in parallel, each performing a different number ofiterations. This interpretation is similar to the Fortrandoall construct, which is included in a number of parallelimplementations of Fortran. 2.3.4. Assignments and Communication Assignment statements in Actus must involve a single extentof parallelism that is less than or equal to the maximumextent of the variables used within the statement. The setof selected processing elements may vary on the left andright sides of the assignment, but the extent of thevariables and expressions must be the same.Thus, data alignment may take place in assignment statementsthrough use of the shift and rotate operators. The usage ofthese operators is as follows:<EOP> shift <INT EXPR><EOP> rotate <INT EXPR>where EOP> is the extent of parallelism of the variableor expression, and <INT EXPR> is an integerexpression. The following example illustrates the use ofdata alignment in Actus assignment statements.var x: array [1:10] of integer;index part = 3:5 + 7:8; ...x[part] := x[part shift -2] + x[part rotate 1]In this example, element 3 receives the sum of elements 1and element 4; element 8 receives the sum of elements 6 and3, and so forth. 2.3.5. Summary Of all the languages considered in this chapter, Actusprovides the cleanest and most self-consistentspecification. Unfortunately, it does so by imposingconstraints that limit the effectiveness of the language forthe broad class of problems that cannot be described interms of parallel data structures with a high degree ofregularity. The control flow model is fairly restrictive inActus, since goto statements are disallowed and thesemantics of conditional constructs impose a particularorder of evaluation for expressions.One of the important contributions of Actus is the use ofindex sets as a means of aligning data. In Actus, theextent of parallelism isalways specified as part of the construct. By providing aconcise method of specifying interprocessor communicationwithout the arbitrary use of pointers, Actus is able toretain much of its expressiveness without compromising itsgoal of strong data typing. The notion of index sets isextremely useful in a general parallel processing model, aswill be seen in later chapters. 2.4. Parallel-C Parallel-C [KuS85b] , as originally proposed, was a supersetof the C programming language that included extensions tosupport both synchronous and asynchronous languageconstructs. In particular, Parallel-C was intended toprovide an explicitly parallel high-level language for thePASM parallel processing system. The original languagespecification includes syntax for synchronous constructs,and a discussion of suggested approaches for asynchronousconstructs.PASM [SiS86] is a Partitionable SIMD/MIMD computerconsisting of a number of control processors and processingelements. The system may be dynamically reconfigured toallow either synchronous or asynchronous execution ofparallel programs. 2.4.1. Expression of Parallelism The style of earlier languages such as Glypnir and Actusbecame the basis for a number of explicitly parallelsynchronous languages. Languages such as Lucas [Fer82] andParallel-C take the notion of index sets one step further,allowing arbitrary selection of parallel shape and extentthrough the use of selectors. In Parallel-C, the keywordselector acts as both a type specifier and a storage classby defining a "vector of bits" that corresponds to the setof processors that may participate in parallel statementsand expressions. A selector variable can be defined for anyshape or extent of parallelism, and becomes the primarymethod of selection when used in conjunction with parallelvariables. As with index sets in Actus, selectors may becombined with set operators such as union, intersection, anddifference to form more complex shape descriptors. UnlikeActus, Parallel-C allows anyvariable to be parallel, requiring only that the shape andextent of the variable be specified at declaration. 2.4.2. Conditional Constructs Parallel-C extends all conditional constructs of C to aSIMD programming model. As in Actus, conditionalexpressions may be either scalar or parallel, with run-timeevaluation determining the extent of parallelism. Theformat of the Parallel-C conditional constructs is asfollows:if (<E>) <S1>;else <S2>;switch (<E> { case <C1> : <S1>; case <C2> : <S2>; ... default: <SN>;}goto <LABEL>;In a strict SIMD environment, the semantics of the Parallel-C if-then-else construct closely resemble those of Actus.If the expression <E> is parallel, the statement<S1> is executed in those PEs where <E> isevaluated to be non-zero. All other PEs are disabled, andmust wait for <S> to be completed in all enabled PEsbefore proceeding with execution. ( In some cases it may bepreferable to execute the body of an if statement and thecorresponding else statement concurrently in an asynchronousmanner, provided that there are no unacceptable side effectsof one statement body on the other. Parallel-C does notprohibit such an occurrence, but nor does it specify theexact semantics of asynchronous (MIMD) operation. )Nested if-then-else statements are handled by maintaining arun-time stack of PE address masks generated throughevaluation of the conditional expression. As in Actus, theset of enabled processors for nested conditional constructsmust always be a subset of those enabled in the enclosingconditional statement. Unlike Actus, however, Parallel-Cdoes not require that the extent of parallelism within thebody of a conditional statement match that of theconditional expression.The switch statement in Parallel-C is significantly morecomplicated than its counterpart in Actus, due to thepossible (but not necessary) inclusion of break statementswithin a case statement. In essence, break statements actas a special instance of a goto statement, where thedestination of the branch is the textual point immediatelyfollowing the closing brace of the switch statement. In astrictly synchronous model of execution, each statementblock must be executed separately; PE masks must begenerated for each case statement before any of the casestatements are actually executed in any of the processingelements. Since some case statements may not contain breakstatements, a logical OR operation of two or more PE masksmay need to be performed to attain the proper set of enabledprocessors for a particular case statement. Likewise, thePE mask for the default statement, if present, is thecomplement of the result obtained by performing an "or"operation of all prior case statement masks.At first glance, these semantics appear to be consistentwith a synchronous model of execution. A closer look,however, reveals some subtle inconsistencies between thesemantics of if-then-else constructs and those of a switchstatement. The fact that processor masks are generatedbefore execution of any case substatements means that eachsubstatement is executed no more than once. Execution canconceptually pass through multiple case statements, eachconsisting of a different shape of parallelism. The flowgraph associated with a particular switch statement couldalso be generated with a series of if-then and gotostatements. However, the semantics of the latter codesequence would prevent the execution of multiple casestatements, since no provision for changing the shape ofparallelism within the if-then statement is supported. 2.4.3. Iterative Constructs The following three iterative constructs have been extendedto support explicit parallelism in Parallel-C:while (<E>) <S>;do <S>while (<E>);for (<E1>; <E2>; <E3>) <S>;The semantics of these constructs are basically the same asthose of the conditional constructs, with the statement<S> being executed in those PEs where the conditionalexpression is evaluated to be true. As in previouslanguages, the set of active processors may not increaseover the course of execution. The body of the do-while loopis always executed at least once in all PEs that encounterthe loop, whereas the statements in the while and for loopsare executed zero or more times, depending on the initialevaluation of the conditional expression. 2.4.4. Assignments and Communication Synchronous languages have taken quite different approachesto the specification of semantics for assignment statements.Languages such as Glypnir and Actus require that a singleextent be associated with any assignment statement.Parallel-C relaxes this constraint somewhat, definingsemantics for the following cases:Left side scalar, right side scalar. (S=S) This isequivalent to a sequential C assignment statement.Left side scalar, right side parallel. (S=P) The right sideof the assignment should select only one element to beassigned to the scalar variable on the left side. If morethan one element is selected, the result of the assignmentis undefined.Left side parallel, right side scalar. (P=S) The scalarvalue on the right side of the assignment is promoted to aparallel value of the shape and extent of the variable onthe left side of the statement before the assignment ismade.Left side parallel, right side parallel. (P=P) Theassignment takes place in those PEs selected on both sidesof the expression, provided that the range of selectedelements on the left side is a subset of those selected onthe right side. If this condition is not met, the resultsare undefined.These semantics have several implications on the model ofexecution. First, the architecture must have some means ofpromoting a scalar value to a parallel value of arbitraryshape. This process usually involves some form of globalbroadcast from the control unit to the selected processingelements. Second, the machine must have a path from each PEto the control unit in order to facilitate parallel toscalar assignments. The Parallel-C semantics for this caseare fairly restrictive, yet they do not adequately definethis type of assignment. Consider the case where a parallelvalue returned from a function call is used as the selectorin an assignment expression that has a scalar variable onthe left side. In such a case it may not be possible todetermine at compile time how many elements will be selectedin the parallel value, thus the scalar result must beassumed to be undefined. Of course, an explicit check canbe made by the programmer, or a warning message can begenerated during the flow analysis phase of compilation, butneither of these options appears to be a desirable solution.Finally, the semantics governing assignments betweenparallel variables disallow implicit communication betweenPEs by requiring that the assignment take place only inthose PEs selected on both sides of the expression. Anexchange of data between PEs must take place by means ofsystem function calls that can perform the desiredpermutations. This is actually quite analogous to thesequential C programming model, where no standard I/Ofunctions are included in the language definition. Thesemantics also imply that a pointer may not be assigned tovalues outside of the memory space associated with theprocessing element in which it is defined. This is asubtle, but important, point.Casting the type and shape of parallel variables issupported in the language model of Parallel-C, but thesemantics associated with such an operation appear to beinconsistent with the rules governing assignments. Withoutspecifying exactly how the reshaping takes place, however,it is impossible to determine whether data is actually movedbetween processors. 2.4.5. Summary Parallel-C attempts to provide explicitly parallelconstructs that can be used in both a synchronous andasynchronous execution environment. Most of the actualdiscussion of the language, however, involves only thesynchronous constructs. The use of selectors in parallelexpressions is an elegant means of specifying parallelprocesses, and can likely be incorporated into a unifiedprogramming model. Unfortunately, the control flowsemantics in the language are inconsistent at times, and donot appear to be extendible to an asynchronous executionmodel.The communication model used in Parallel-C requires thatassignments between parallel variables of differing shapestake place only in elements selected on both sides of thestatement. Other assignments can still take place, but mustuse library or system calls to perform the permutation.This conservative approach may not yield the most generalprogramming model, but it does simplify the implementationof the language considerably. 2.5. C* C* [RoS87] was proposed and implemented as a parallelextension of C for the Connection Machine [Hil85] . Itsprogramming model is intended to allow easy manipulation of"data-level parallelism" through use of special paralleldata structures and operators. The Connection Machineconsists of a large number of simple processing elementsthat are effectively connected in a "hypercube"configuration. Access to the machine is provided through anattached front-end host that acts as the control processorfor the system. Functionally, the Connection Machinebehaves as a SIMD model, although C* is intended to supportboth large and fine-grained parallelism. 2.5.1. Expression of Parallelism As in other synchronous languages, C* distinguishes betweenvariables and data that reside in the front end, or controlunit, and those which are allocated in the processingelements. C* uses the terms mono and poly, respectively,but also introduces the concept of domains. A domain is astructure type that describes the memory layout of a dataprocessor; its syntax closely resembles that of a class inthe C++ language [Str86] . By definition, all code thatbelongs to a domain is parallel, while all other code isserial. Poly variables are only parallel if accessed inparallel code. In C*, only code that is activated within adomain may be executed in parallel. This activation isperformed with a selection statement, as defined below:[domain tag] . statementThe selection statement activates each instance of thespecified domain, allowing parallel execution of thesubsequent statement. When the statement is completed, allinstances of the domain are deactivated. 2.5.2. Conditional Constructs The control flow model in C* is perhaps the most complex(and general) set of semantics yet proposed for control flowin an explicitly parallel language. All standard Cconditional constructs are supported in C*, although thegoto statement has not yet been implemented.The C* programming model provides a single locus of control,referred to as the "master PC." The "master PC" is analogousto the program counter in a sequential architecture, and isconsistent with the synchronous (SIMD) model of executionpresented by the Connection Machine. In parallel code (e.g.code invoked within a domain), each data processor selectedby the current domain is also assigned a "latent PC" that isdesignated as being either "active" or "waiting." An activelatent PC is always at the same point in the code as themaster PC, while a waiting latent PC is elsewhere in theprogram. The desired effect of this model is that parallelcode should be executed as if each processor is executing aseparate copy of the code, but with all processors remainingsynchronized. To this end, C* proposes a number of rules togovern the semantics of conditional and iterativeconstructs, including the Rule of Merging Control and theBlock Synchronization Rule.The Rule of Merging Control asserts that whenever the masterPC passes a point during execution where one or more latentPCs are waiting, all such latent PCs become active andcontinue execution with the master PC.The Block Synchronization Rule is the heart of the C*control flow model. This rule states that whenever themaster PC reaches the end of a statement or substatement,every active latent PC becomes inactive and must "wait" atthe point immediately following the statement. At thispoint, execution is halted and the master PC is transferredto a new point in the code. The new point is selected byexamining the just completed substatement and then searchingoutward through the lexical hierarchy of nested statementsuntil a waiting latent PC is found. Control is thentransferred to this point via the master PC and executioncontinues. 2.5.3. Iterative Constructs In C*, iterative constructs are handled exactly asconditional constructs, by invoking the BlockSynchronization Rule. As in Parallel C, all standard Cconstructs supporting repetitive evaluation of a conditionalconstruct are supported. The set of active processorsparticipating in a loop may only decrease monotonically. 2.5.4. Assignments and Communication C* attempts to provide a more general set of assignmentsemantics, but still tends to infer too much from theexecution model of the target architecture. Assignmentsemantics in C* can essentially be summed up in two rules:the Replication Rule and theAs-If-Serial Rule.The Replication Rule statesthat a scalar value is "automatically" replicated wherenecessary to form a parallel value. The word"automatically" again implies the presence of a broadcastmechanism in the hardware, a rather reasonable assumptionshared by Parallel-C (and many other parallel languages).The As-If-Serial rule assertsthat some sequential ordering will be applied on anoperator-by-operator basis for every active processor in astatement. In other words, operations on poly (parallel)variables take place as ifthe active processors performed the operations in someunspecified (and unpredictable) order. These semantics arecritical, for example, when performing reduction operationswhere a number of processors are updating a single mono(scalar) variable. The As-If-Serial rule is said tofacilitate parallelism by imposing sequential semantics.While this may seem intuitively true for cases where theleft-hand side of an assignment is a mono variable, it lessclear in the case of parallel assignments. Application ofthe As-If-Serial rule also brings up an interesting case ofbroken symmetry, where statements that would be equivalentin a sequential language lead to different results when theoperands are parallel values.The C* semantics for assignment statements differ from thosepresented in Parallel-C in two important ways. First,assignments from parallel to scalar variables are supportedthrough invocation of the As-If-Serial Rule. Thus, C*allows multiple reduction and multiple broadcast operationsto take place among values of arbitrary shape. The seconddifference involves the use of pointer indirection as ameans of interprocessor communication. In C*, the statement*p = *q;sends a "message" containing the value stored at location qto location p. Although this appears to be a desirablefeature of the language, it infers a great deal from theunderlying machine model. 2.5.5. Summary C* makes an ambitious attempt to specify a completelygeneral synchronous model of computation. The languagesemantics are fairly self-consistent, with the possibleexception of some of the control flow constructs, asdescribed above. The primary difficulty with C* involvesthe inherent SIMD synchronization present in the semanticsof the communication model. By specifying that operands toassignments are evaluated on each side of the statement forall instances before any assignments are made, C* preventsasynchronous execution from ever taking place. If anordering of evaluations is desired, the programmer couldsimply specify appropriate code to accomplish what the As-If-Serial Rule requires.For instance, the assignment:x = y;where x and y are defined as poly variables, will firstevaluate x in each active processor, then evaluate y in asimilar manner, and finally make the appropriateassignments.Instead of requiring such an ordering, the same effect couldbe attained by replacing the above assignment with:tmp = y;x = tmp;The less restrictive semantics needed for this second caseare preferable in a general parallel programming model. 2.6. Conclusions Based on the above discussion of synchronous languages, acouple of observations can be made concerning the expressionof parallelism through variable declarations. First,variable declarations should provide a means of "hiding" themachine topology by directly expressing the datarelationships of the algorithm to be executed. Thischaracteristic is especially important if the program is tobe scalable and portable. The second observation involvesthe use of domains, as in C*. Although domains are areasonably elegant way to describe the memory layout ofprocessors, they are not (as currently specified) easilyextended to an asynchronous programming model.One of the problems inherent in all of the control flowmodels discussed thus far is the failure to differentiatebetween three basic attributes of parallel computation. Ina sequential language, semantics need only specifyhow a construct is to beinterpreted in the context of a given expression orstatement. In a parallel language (explicit or otherwise),the semantics must specify not only how, but alsowhen andwhere a construct can (or cannot) be applied. It ishardly a coincidence that a compiler needsexactly this information in order toperform automatic parallelization.In most synchronous programming languages, the semantics forconditional and iterative constructs provide a reasonablyconsistent definition of how and when a particular constructis to be executed, but they do not always address the issueof where the construct may be applied. Consider, forexample, the goto statement. Most languages either forbidthe use of such a statement in parallel code (e.g. Actus),or restrict the control flow model in some asymmetric way(e.g. Glypnir).Languages such as Parallel-C and C* attempt to define a moregeneral control flow model, but fail to specify self-consistent semantics for arbitrary control flow patterns.For instance, the C* language specification does notrestrict the use of a goto statement within parallel code,yet it is unclear what would result of a branch from thebody of one conditional statement to that of another,especially if the set of active processors differs in thetwo code sequences. The C* code sequence of Listing 2.1illustrates this point.extern int random();domain grad_student { short salary; double hours; int knowledge; int passed_qualifier; mono char *poly adviser;} Jim, Mike, Scott, Carla, Todd, Greg; [domain grad_student].{TEST: passed_qualifier = (random() & 1); if (!passed_qualifier) { knowledge = 0; goto STUDY; } else { knowledge += 100; } while (knowledge < 1000) { ++knowledge; if (knowledge > 500 && !passed_qualifier) { goto TEST; }STUDY: hours += 2.0; } } Listing 2.1 Arbitrary control flow in C*Admittedly, this is a rather contrived example, but itcould be executed on a sequentialmachine. In a parallel model of execution, however, the setof processors that execute the if substatement will likelydiffer from those enabled in the body of the whilestatement. In this case, the C* control flow model appearsto support contradictory semantics. The semantics foriterative and conditional constructs state that the set ofactive processors for each iteration can only decreasemonotonically, yet the Rule of Merging Control asserts thatthe latent PCs "waiting" at the STUDY label will becomeactive as the master PC passes that point in the code.Clearly, an ambiguity results. A similar situation couldresult when a goto statement branches from one domain toanother, although the latent PCs that exited the domain viathe goto statement would presumably be "destroyed" uponcompletion of the code block containing the goto statement(as a result of the Block Synchronization Rule).Although virtually all synchronous languages share theproblem illustrated above, the means of specifying paralleldata in such models can be abstracted to a more generalmodel where no distinction is made between a process and theprocessor on which it executes. This concept will beexplored in greater depth in Chapter 4.

3. ASYNCHRONOUS LANGUAGES

3.1. Asynchronous Constructs While synchronous languages tend to express parallelismthrough data layout and algorithm topology, asynchronouslanguages focus on processes and the means of interprocesscommunication.Two basic methodologies have been developed to deal withasynchronous parallelism, one based on the model presentedin CSP [Hoa78] , and another that is often referred to as asingle program, multiple data(SPMD) model. Examples of each language type will beconsidered in this chapter, including Occam, Concurrent C,Linda, and the Force.Although the two models differ somewhat in style andimplementation, both share a common means of expressingparallelism through asynchronous constructs:specification, creation, and termination of sequentialprocessesconnections to, and communication with, other sequentialprocessesUnlike synchronous constructs, most of the problems withasynchronous constructs lie in the difficulty of efficientlyimplementing the language model on existing parallelmachines and in the fact that static analysis of programbehavior is both difficult and of limited use. 3.2. Occam Occam [Inm86] was developed as a language for programmingsystems constructed using multiple Inmos Transputer chips.Although the transputer chip only provides for local memory(memory directly accessible to just one processor), thetransputer chip also incorporates four high-speed seriallinks that can be used to send information to othertransputers.Occam is based on the Communicating Sequential Processes [Hoa78] model for parallel execution, in which input andoutput primitives are treated as basic building blocks ofstructured communication between otherwise sequentialprocesses. Nondeterminism in the CSP model is handledthrough the use of guarded commands that permit input andoutput specifications to be integrated with control flowconstructs. 3.2.1. Specification of Processes The primary means of specifying executable code sequences inOccam is a process. Occamcontains three primitive processes, each of which can becombined with other primitives to form more complexsequential or parallel processes. The primitive processesdefined in Occam are assignment, input and output.Assignment simply involves changing the value of a variable,much like assignment statements in sequential languages.Occam variables are declared without type, and arerestricted in scope to the process in which they aredefined. For example, the Occam statementa := b + csimply sums the values of variables b and c and assigns theresult to the variable a.The input and output primitives, signified by the symbols ?and !, respectively, are used to pass values betweenprocesses. The Occam statementsin ? x;yout ! x+ywill input values for the variables x and y from the channelnamed in, add the two values, and output the sum to thechannel named out. In general, any valid Occam expressionmay be specified as the value to be passed to anotherprocess via an output statement. The use of the input andoutput primitives are discussed further in the followingsection.Primitive processes can be combined through the use ofconstructors. The three basic Occam constructors are seq,par, and alt, corresponding to sequential, parallel, andalternative processes. In a sequential process, primitivecomponent processes are executed in a textually specifiedsequence; the process terminates upon completion of thefinal component process. Parallel processes allow componentprocesses to execute concurrently, with the parallel processterminating only when execution has completed in allcomponent processes. Parallel processes are differentiatedon the basis of the indentation of the code. The textualorder of the processes is insignificant, since the processesproceed concurrently.A slightly different process specifier is the altconstructor. The alt constructor specifies that one of thecomponent processes is to be arbitrarily selected andexecuted. Each component of an alternative process has a"guard" associated with it as a means of selecting one ofthe component processes for execution. The guard consistsof an input primitive along with an optional Booleanexpression. An alternative process remains idle until atleast one guard condition has been met. If the condition issatisfied in more than one process, one of the processes isselected at random. Execution of the selected process thentakes place.A simple Occam program is illustrated in the Occam codesegment of Listing 3.1.chan ina, inb:chan outa, outb:chan intesta, intestb:seq prod1 := 0 prod2 := 0 par seq var testa: intesta ? testa while testa var a: seq ina ? a prod1 := prod1 * a outa ! prod1 seq var testb: intestb ? testb while testb var b: seq inb ? b prod2 := prod2 * b outb ! prod2 Listing 3.1. Occam process specificationThe two while loops in Listing 3.1 are executed in parallel,but the code within each loop is executed sequentially. TheBoolean expression true is evaluated as part of therepetitive while process, and, in this case, indicates thatthe loop processes will never terminate. In additional tothe repetitive while process, Occam provides a conditionalprocess, signified by the if keyword. Since code within anOccam process is local to that process, the semantics ofconditional and repetitive processes are quite similar tothose found in sequential languages; these constructs arenot discussed further here.Occam processes can be named in a manner analogous tosubroutines in a sequential language, although theimplementation more closely resembles that of "macro"substitutions. For example, the two parallel processes inListing 3.1 could be specified as follows:proc multiplier (chan in,out,intest var prod) = seq var test: intest ? test while test var x: seq in ? x prod := prod * x out ! prodwhere in, out, intest, and prod are formal parameters repre-senting the channels and variable unique to each process.The parallel invocation of the multipliers could then bespecified as:par multiplier (ina,outa,intesta,prod1) multiplier (inb,outb,intestb,prod2)It is also possible to specify a group of parallel processesusing a replicator to create "an array of concurrentprocesses" all executing the same code:chan in[4], out[4], intest[4], prod[4]:par i = [0 for 4] multiplier (in[i],out[i],intest[i],prod[i])would create four processes, one each for values of i equalto 0, 1, 2, and 3. This notation is concise, but the Occamscheme for specifying processes can lead to a great deal ofinefficiency.Consider a parallel algorithm that is to perform anoperation on each element of a 64 element array, on acomputer that has only 16 processors to devote to theproblem. Given the Occam model, 64 processes must becreated, despite the fact that only 16 will be executing atany given time. No provision is made in Occam forgenerating code that would divide the workload evenly amongthe 16 processors as just 16 processes. This lack ofdynamic load balancing capability becomes extremelyinefficient when the time needed to perform operations onindividual elements varies widely. In fact, the effectiveexecution rate in such cases becomes bounded in a mannerthat is similar to execution on a SIMD architecture. 3.2.2. Communication Between Processes In Occam, a process is generally associated with aprocessor, and connections between processes are referred toas channels. A channel connects two parallel processes suchthat one process can "send" a value onto a channel and asecond process can "receive" that value from the channel. Aprocess can communicate with any number of other processes,but communication is only one-way; a second channel must beestablished between two processes if two-way communicationis to take place. Consider, for instance, the followingOccam code sequence:chan in, out:var x:seq in ? x out ! xThis sequence defines two I/O channels, IN and OUT. Theprogram first reads a value from the channel called IN(using the input operator, "?") and stores this value intothe variable called X. The value is then sent out on thechannel called OUT (using the output operator, "!").By treating communication primitives as programming objectsthat can be manipulated as data and variables, Occam hasprovided a clean, if not overly elegant, method ofexpressing communication between processes. The problemwith communication in Occam, however, is that transactionsbetween processes can only be expressed through a channel.The destination of a message between processes can not bedirectly specified, but instead must be explicitly routedthrough intermediate processes. This scheme not only lowersthe efficiency of processes that must pass messages alongfrom neighboring senders and receivers, but also requiresthat an algorithm be carefully mapped onto the machinetopology. Thus, Occam programs are typically tied to aparticular interconnection pattern and cannot be easilymoved between machines. 3.2.3. Summary Although Occam provides a concise representation of the CSPprogramming model, it is very closely linked with the Inmostransputer for which it was implemented. The necessity ofexplicitly specifying communication channels leads to codethat becomes difficult to move between systems or evenmachine configurations. The synchronous message passingmodel also presents difficulties in that synchronization andcommunication issues are once again inseparable.Occam process control is flawed in that the programmer musteither bear the inconvenience of writing scheduler code orsuffer inefficient execution due to ineffective processmanagement. 3.3. Concurrent C Concurrent C [GeR86] extends the CSP model used in Occam toprovide a more general (and complex) concurrent programmingmodel. Like Occam, Concurrent C chooses a synchronousmessage passing scheme to facilitate parallelism, withprocesses proceeding independently except when communicationtakes place. Concurrent C rejects a shared memory model ofasynchronous computation on the basis of intendedapplications, such as a collection of workstations connectedby a local area network.The first distributed implementation of Concurrent C targetsseveral identical computers connected by an Ethernet. Inthis implementation, however, only processes on the samecomputer can share global data. 3.3.1. Specification of Processes Concurrent C separates the definition of a process from itsinstantiation. A process definition consists of a type(specification) and a body (implementation). Eachinstantiation of a process definition is referred to as aprocess. Concurrent C processes are sequential in nature,and follow directly the semantics of a sequential C program.Each process effectively has its own set of resources, suchas a stack, registers, and program counter. The schedulingof processes on the underlying architecture is not specifiedby Concurrent C, as this is considered to be animplementation issue. In the definition of a process, onlythe type is visible to other processes; the body of adefinition cannot be externally accessed.A process type is defined as:process spec process-type-name (parameter declarations) {transaction declarations};For instance, the declarationprocess spec semaphore (int count){ trans int inc(); trans int dec();}defines a process that has one parameter and two associatedtransactions.A process body specifies the statements that are to beexecuted whenever the process is invoked. In many respects,a process body is similar to the body of a function in C,except that a process cannot return a value. The format ofa process body is:process body process-type-name (process-parameter-names) statementA process is invoked through use of the create operator asfollows:create process-type-name (initial-values) [priority(p)]The integer expression p is an optional parameter that givesthe process a priority relative to some implementation-dependent standard. A number of built-in functions are alsoincluded in Concurrent C to assist in the handlingpriorities, but these fall beyond the scope of the currentdiscussion. Each created process returns a unique id thatcan be stored in a process variable of the same type.Arrays of process variables and pointers to processes arealso valid in Concurrent C. For instance, given the processdefinition of semaphore above, the following variables couldbe defined:process semaphore s, sa[8], *sp;A Concurrent C program begins with a single active processthat calls the function main(); execution of the programproceeds from the body of the main function.A process can exist in one of three states at any giventime. A process is activeas soon as it is created, and remains so throughout theperiod during which the statements of the process body arebeing executed. A completed process is one that hasexecuted a return statement in the process body or hasreached the end of the body statements. Finally, a processis said to be terminated when the process and all processescreated by it have completed. A process can also becometerminated by executing a terminate alternative as part of aselect statement. (The select statement will be discussedin the following section). 3.3.2. Communication Between Processes As in Occam and CSP, communication in Concurrent C isperformed by a synchronous message passing protocol. Themessage passing primitives collectively form arendezvous between two processes, where oneprocess acts as the client and the other acts as the serverof the message. Whereas communication in Occam isunidirectional (a simple rendezvous), Concurrent C supportsbidirectional communication referred to as anextended rendezvous, or a transaction. Atransaction in Concurrent C has the format:process-value.transaction-name (actual-parameters)where process-value specifies a specific process, andtransaction-name designates a transaction that is containedin the process type of the chosen process. Arguments arepassed by value as in C function calls.A transaction is initiated, or called, by the clientprocess, which is then "blocked" from further executionuntil the requested server process accepts the transactioncall. When a rendezvous is established, the message ispassed from the client to the server. The client remainsidle while the server performs the requested task. Uponcompletion of the task, the server returns the results, ifany, to the client process, which may then proceed withexecution.For each transaction within a process, the process typespecifies a name for the transaction, the types of thearguments passed to the server, and type of value returnedto the client. Thus, in the semaphore process definedabove, both the inc and dec transactions return an integervalue but have no arguments. The following exampleillustrates the creation of a process and a call to themember transactions of the server process s.int i;process semaphore s;s = create semaphore(8);s.inc();...s.dec();This example demonstrates a pair of transactions from theperspective of the client. Since Concurrent C transactionsare bidirectional, it is also necessary to examine thelanguage mechanisms provided to service a transactionrequest. The accept statement is a transaction as seen fromthe side of the server, or called process. The acceptstatement has the form:accept transaction-name (parameter-names) [suchthat (e)][by (ae)]where suchthat specifies an optional boolean expression thatinvolves one or more of the accept statement parameters, andby specifies an arithmetic expression that denotes anordering of the transactions to be accepted (the default isFIFO order). In a sense, the suchthat clause serves the samepurpose as processor address masks in synchronous languages,since only those transaction calls that have an expressionthat evaluates to be true are considered for acceptance.Concurrent C provides a select statement to handlealternative, nondeterministic events. The format of theselect statement is:select { [(guard1):] alternative1or [(guard2):] alternative2...or [(guardn):] alternativen}where a guard is a booleanexpression and analternative is block of Concurrent C statements. The orderof evaluation of guard expressions is unspecified, and allguards are not guaranteed to be executed. Exactly onealternative is chosen and executed by a select statement.If guard expressions for multiple alternatives are evaluatedas true, a series of precedence conditions are followedbased on the type of the alternatives. The type of analternative is determined by the textually first statementappearing within each block. Highest priority is given toan alternative that begins with an accept statement forwhich a call is pending. Next in line is an "immediate"alternative (an alternative that contains no accept, delayor terminate statements).Thus, communication between processes is intricatelyinterleaved with control flow in the Concurrent C model.Nondeterminism is inherent in the control flow structure,and the programmer must take care to ensure that desiredprogram behavior results when using Concurrent C. 3.3.3. Summary Concurrent C introduces a fairly complex set of newoperators and program structures to support asynchronousparallel programming. As in Occam and CSP, the semantics ofprocess specification are fairly consistent, and code withina process body usually executes as if it were on auniprocessor. A synchronous message passing model is usedfor interprocess communication, but unlike Occam, thiscommunication can be bidirectional. Concurrent C manages toavoid some of the pitfalls of Occam in regards to processconnection patterns, but still falls short of the ideal ofindependence from machine topology. Through the use oftransaction pointers and process variables, Concurrent Cprocesses are able to communicate directly with otherprocesses, regardless of the physical position of theprocesses in the system. Transactions need not traverseneighboring processes that otherwise would not be involvedin the exchange of information in order to reach adestination. Where Occam tends to map only to a particularmachine (transputer), Concurrent C maps to a class ofmachines - namely, those that do not use a shared memorymodel of execution. ( Actually, Concurrent C can beimplemented on a shared memory system, but constructs formanaging shared memory were deliberately left out of thelanguage to ensure portability. )The primary problem with the Concurrent C model is theinability to express synchronization and communicationindependently. As currently specified, the synchronousmessage passing model requires synchronization in all cases,including those in which synchronization may not benecessary. 3.4. Linda A dramatic departure from the synchronous message passingmodel of Occam and Concurrent C is presented in Linda [GeC85] .Linda was developed as a programming tool to simplify thespecification of explicitly parallel programs. The primarygoal of Linda is to uncouple the spatial and temporalaspects of process management in a parallel processingenvironment. The Linda system consists of a preprocessorand compiler that translate language constructs into hostlanguage code, and a host language-independent runtimekernel that handles interprocess communication. Lindaconstructs have been added to both C and Fortran hostlanguages, and the sytem kernel has been implemented onAT&T Bell Labs' S/Net multicomputer, a MicroVAX network,and the Intel iPSC hypercube. 3.4.1. Specification of Processes Linda, despite its sophisticated management ofcommunication, does not stress the specification of parallelprocesses. The most unique feature of the Linda system isthe use of a shared tuplespace (TS) as the memory model. Rather than accessingconventional memory by addresses, Linda uses a logical tuplename to refer to data objects. In some respects, Lindadeliberately avoids the specification of processes, optinginstead to use distributed data structures as the means ofexpressing parallelism. Unlike many asynchronous languages,Linda allows shared data objects to be accessed by anyprocessor. There is no notion of a "manager process" or"monitor" that accesses shared data on behalf of arequesting process. A Linda program is not partitioned intoprocesses to be assigned to processors. Instead, Lindaemploys a "replicated worker" model that replicates aprogram sequence several times to form processes thatexecute simultaneously on distributed data structures.Processes in Linda are therefore uncoupled, with no aspectsof a machine or algorithm topology present in the model.Linda's process management scheme relies heavily on theability of the underlying system to perform dynamic taskscheduling, especially when the time required to completeindependent tasks varies significantly. The Linda modelattempts to support dynamic scheduling by providing adistributed data structure referred to as a "task bag." Allprocesses receive work assignments from the task bag, andadd to the bag any additional tasks that are created duringthe course of executing the current assignment. This schemealso leads to scalable programs, since increasing the numberof available processors will likely increase, and neverdecrease, system performance. 3.4.2. Communication Between Processes The model of communication in Linda is far more abstractthan that of Occam or Concurrent C. Through use of theshared tuple space, all processes may directly communicatewith one another. Hence, rather than sending data to adestination (Concurrent C) or through a channel (Occam), aLinda program places tuples of data into the tuple space.Once a tuple has been placed in the TS, any process mayexamine or remove the tuple from the TS. A tuple in Lindaconsists of character string that acts as a logical name, aswell any other specified variable fields. Access to aparticular tuple is granted by matching the name tag and allother tuple fields. Hence, the tuple space is essentiallyan arbitrarily large, fully associative, tuple memory sharedby all processes.Parameters specified without type are assumed to beactual parameters; the keyword formalmay also be used to treat a previously declared parameter asan actual value during an operation on the tuple space.The tuple space may be accessed through the use of threeprimitive operations: out(), in(), and read(). The out(t)and in(t) operations cause a tuple t to be added to orwithdrawn from the tuple space. The read() operation makesa copy of the requested tuple parameters, if found, withoutactually removing the tuple from TS. Thus, successfulcompletion of either the read() or in() operations isdependent upon an exact match of the requested tupleparameters and those of the named tuple currently residingin the tuple space. A fourth, non-primitive operation,eval(), will allow unevaluated tuples to be added to thetuple space.An example of communication via the tuple space isillustrated below:/* place ("abc", 5, 6) into TS */out("abc", 5, 6);/* get tuple called "abc" from TS *//* here, i is set to 6 */in("abc", 5, int i);There are three basic difficulties in using this model. Thefirst is that sequential segments of a Linda program aredifficult to describe statically. For instance, it isdifficult to determine which process will in the tuple thatanother process put out to TS. Although the awkwardconstraint of Occam that communication take placeexclusively between neighboring processes is removed, thefact that Linda provides only a single shared name space fortags makes it difficult to ensure that multiple processeswill not unexpectedly generate the same tag.The second problem is that the fully associative nature ofthe processing of TS tuples makes it difficult to constructan efficient implementation. While shared memory models donot necessarily imply the presence of physically sharedmemory in hardware, an associative tuple space requiresspecial hardware that is not available in most systems.This limits the portability of the Linda system.Finally, the use of a single shared tuple space requires allvalues to be globally visible. As a result, optimizationsbased on locality of reference and lexical scoping cannot beused for programs written in Linda. 3.4.3. Summary Linda represents a fundamental change of direction inasynchronous programming models. Rather than support asynchronous message passing model, Linda allows data to beglobally shared throughout the system. Instead of sharingmemory, however, Linda systems share a tuple space thatcorresponds to names of data objects.Although Linda may present some difficulties as a generalparallel programming model, the language semantics are verystraightforward. As a result, it may be possible to useLinda as the target of a higher level language that presentsthe user with a more general shared memory model.Transformations between shared variables and entries in thetuple space are made possible through use of the in() andout() operations described above. The efficiency of suchtranslations is likely to remain extremely machinedependent, however. 3.5. The Force Although the programming models used by "languages" such asthe Force [Jor87] , BBN Uniform System [Bab88] and IBM EPEX [Bab88] are quite different from those used in CSP, Occam,Concurrent C or Linda, they still tend to support anasynchronous style of parallel programming. In particular,the programming model used in the Force, Uniform System andEPEX is often described as a SPMD model for multiprocessorsthat utilize a shared memory space. The Uniform System wasdeveloped primarily for the BBN Butterfly, although it hasessentially been replaced with the Mach programmingenvironment of Carnegie Mellon University. EPEX wasinitially designed for use on the IBM RP3 system, but hasalso been implemented on the IBM 3090 series. The Force hasbeen implemented on a number of machines, including theDenelcor HEP, the Flex/32, and the Sequent Balance Series.These machines all share the characteristics that mainmemory is directly addressable by each processor and thatmemory access time is essentially independent of the memorylocation being referenced. As a result, communicationbetween processes is accomplished primarily through the useof shared variables, rather than the message passing schemeused by languages discussed previously. For the purpose ofthis discussion, a shared variable is any variable that maybe directly accessed by any process in a system, regardlessof the physical attributes of the underlying hardware.Thus, shared variables could conceivably be present even insystems that do not support shared memory in the systemhardware.It is interesting to note that none of the SPMD systemsdescribed above is actually a language; each consists of aset of language constructs that are used as macros withinconventional sequential programs. The macros are resolvedby a preprocessor, and code is generated by a standardsequential compiler. Rather than describe the exact syntaxof all three systems, this discussion will focus on one -the Force. The discussion is relevant to EPEX and theUniform System as well, since all three systems share asimilar style and structure. 3.5.1. Specification of Processes The most fundamental characteristic of the Forcecomputational model is the lack of explicit programmercontrol over process management. Processes are nevercreated or destroyed over the course of execution in a Forceprogram; processes are allocated and created at programinitialization by low-level system dependent code. Unlessrestricted by an explicit parallel primitive, all statementsin a Force program are executed by all processes inparallel. The number of processes assigned to a program isalways independent of the number of actual processesavailable on the system. The management of both actual andvirtual processes is therefore hidden from the Forceprogramming model. Despite the lack of process control, TheForce model does support the notion of separate instructionstreams, thus supporting asynchronous parallelism.A Force program begins with the declaration of a force ofprocesses as follows:Force <name> of <no. of procs> ident <proc id> <variable declarations> [Externf <module name>]End declarations <Force program>Joinwhere parallel environment variables are visible to themacros within the program, and runtime information regardingthe number of processes and the current process identifieris available to the programmer. The Join macro signifiesthe end of the main program in the Force. A parallelsubroutine can be invoked in a similar manner, with themacro Forcesub used in place of the Force macro above.Parameters are passed to a Force subroutine just as theywould to a conventional Fortran subroutine. A FortranRETURN statement is used to terminate a Force subroutine;all processes must execute the statement at some time,however.Macros corresponding to control flow operations in asequential language actually deal with distributing the workload in a Force program. Loop structures are based on theDOALL construct of parallel Fortran implementations. In aDOALL loop, each process executes the body of the loopstatement for a different value of the index variable,provided that their are no restrictive or ambiguous datadependencies between iterations of the loop. The loop bodyis therefore treated as a sequential code block by eachprocess, with no particular ordering specified for executionof the different iterations. The Force provides twovariations of the DOALL construct, as illustrated below:Presched Do <n> <var> = <i1>,<i2>[,<i3>] <loop body><n> End Presched DoSelfsched Do <n> <var> = <i1>,<i2>[,<i3>] <loop body><n> End Selfsched DoThe Presched Do macro statically assigns index values toprocesses on the basis of the number of index values to beiterated and the number of processes available in the force.The Selfsched Do macro, by contrast, allows processes todynamically schedule themselves over the iterations of theloop by acquiring the next available iteration from a sharedindex variable. Doubly nested loops may also beparallelized in certain situations through use of the Pre2doand Self2do macros that closely resemble the aboveprescheduling and self-scheduling macros in both format andusage.A parallel case statement is also provided to distributeindependent single stream code blocks across the force ofprocesses. The Pcase macro is declared as:Pcase on <variable>[Pcond(<Fortran IF condition)] <code block>[Usect][Pcond(<Fortran IF condition)] <code block> ...End PcaseThe conditions associated with each block of code areevaluated in an arbitrary order, each by a single process.Since each "case" is associated with only a single(arbitrarily chosen) process, conditions are generally mostuseful when involving shared variables. 3.5.2. Communication Between Processes Unlike languages based on the CSP message passing model, theForce uses shared variables to implement interprocesscommunication. In the Force environment, a variable musteither be private to a single process, or globallyaccessible by all processes. The distinction betweenprivate and shared variables is deliberately keptindependent of textual scoping constructs, such as theFortran common directive, to avoid confusion of parallelattributes with scoping issues. The format of variabledeclarations in the Force is:Private <type> <variable list>Shared <type> <variable list>Async <type> <variable list>where Async variables are a special class of sharedvariables that are to be used in certain synchronizationconstructs to be described later in this section. In theFortran implementation of the Force, the type of a variablelist in the above declaration formats may be preceded by thedirective Common /<label>/ as in conventional Fortranprograms. Shared variables may be directly referenced byname, as can Private and Async variables. Thus, addition ofparallel attributes in the Force do not affect the namingconventions of the underlying language.Since explicit process control is not visible to theprogrammer in the Force, there is never a need to "send"values to, or "receive" values from, other processes.Values are instead passed between processes throughvariables declared as being shared. One side effect of thiscommunication model is the need for explicit synchronizationprimitives. Languages such as Occam and Concurrent Cprovide implicit synchronization in the message passingmodel, since processes requesting service must wait for therequest to be accepted by another process. The Forcecomputational model separates the issues of communicationand synchronization to some degree, and must thereforeprovide a set of appropriate constructs to managesynchronization. The synchronization macros in the Forcecan be divided into two classes, depending on whether theydeal with control flow or data synchronization.The primary control-oriented synchronization macro is usedto implement barriers across processes. The format of thismacro is:Barrier <code block>End BarrierThe semantics of a barrier in the Force require that allprocesses execute the Barrier macro before one (arbitrarilychosen) process proceeds to execute the block of code withinthe barrier definition. When this execution is complete,all processes resume execution with the statement thatimmediately follows the End Barrier macro. These semanticsunfortunately limit the ability of a programmer to breakprocesses into subtasks, since only a single process mayexecute the code within a barrier block. This requirement,of course, was necessary in order to avoid explicitspecification of processes - a clear objective of the Forcemethodology.A second control flow synchronization object allows criticalsections to be specified for any shared variable. Theformat of a Force critical section is:Critical <lock-variable> <code block>End Criticalwhere the specified code block may only be executed by asingle process at a time in all critical sections thatspecify the same lock-variable.Data synchronization is based on the notion ofproducer-consumer pairs, where sharedvariables contain both a value and a binary state thatindicates whether the variable isfull or empty. Datasynchronization objects must be declared as Async variablesto convey the fact that they may be accessed in anasynchronous manner by various processes. The datasynchronization mechanisms in the Force are described asfollows:Produce <async variable> = <expr>Consume <async variable> into <private variable>Copy <async variable> into <private variable>Void <async variable>Isfull(<async variable>)where an invocation of any of the above processes isconsidered to be an atomic (indivisible) operation withrespect to other processes. The Produce macro waits for theasynchronous shared variable to be set empty, sets its valueto the result of the specified expression, and marks thevariable as being full. The consume macro waits for theasynchronous variable to be set full, at which time itsvalue is assigned to the specified private variable and theasynchronous variable is once again marked as empty. Copy issimilar to Consume, except that the asynchronous variable isnot set the empty state after the assignment is made. TheVoid macro sets the state of an asynchronous variable toempty, regardless of the original state of the variable.Finally, the Isfull() macro acts as a logical function callthat returns the state of an asynchronous variable withoutblocking on either full or empty states. Care must be takento insure that deadlock ( For the purpose of thisdiscussion, deadlock refers to any situation in which aproducer and consumer of a shared object reach a state whereneither can proceed without the other process changingstate. ) does not occur in code that uses producer/consumersynchronization, since the Force does specify any deadlockprevention mechanisms. 3.5.3. Summary The problem with the SPMD model is primarily that sharedvariables impose very little structure on communication andrelatively small penalties for having shared structures(especially on the machines for which these systems weredeveloped), hence there is a tendency to write SPMD programsin which very little concern is invested in minimizingvisibility of data. To communicate between two processes,one must create a shared variable that is visible toall processes -- essentially thesame problem encountered with Linda's mechanism, except thatreferences are made using a name given at compile-timerather than a runtime associative lookup. The problem isnot unlike that of developing large sequential programs asindependent modules using a programming language that onlyprovides for global variables.Of course, not all communication is for the purpose oftransmitting values; there is also the problem ofcommunicating synchronization objects. There are a widevariety of synchronization mechanisms supported, but the keyissue is really the determinism of the program's execution,not the primitives used to achieve that goal. 3.6. 2 Conclusions As opposed to synchronous languages, where parallelism isexpressed through data layout and algorithm topology,asynchronous languages focus on communication betweenprocesses as the means of managing parallel execution. Themodel of communication varies from the message passing ofCSP to the shared memory SPMD model of the Force.Regardless of the model, however, it is apparent thatsynchronization has traditionally been closely associatedwith communication in asynchronous languages. Thus, the keyaspect of asynchronous languages, as opposed to synchronouslanguages, appears to be the inclusion of explicitsynchronization.Synchronization in a programming model is not really theissue, however; it is simply a means to an end, where the"end" is actually determinism. Although the set ofsynchronization mechanisms in most asynchronous parallellanguages is quite rich, the provision for specification ofdeterminism is absent. Nondeterminism is inherent inconstructs such as the select statement of Concurrent C andthe Pcase macro of the Force, but the programmer has nogeneral means of expressing desired or acceptablenondeterminism in an algorithm. The nondeterminism presentin all asynchronous models examined thus far is based onlyon the selection of one of several events that occur in anunknown order. Determinism is enforced throughsynchronization constructs in all other cases.For parallel execution of a program, determinism refers tothe ability of a code segment to produce the same output fora given set of inputs over a large number of runs.Deterministic execution will occur if and only if the codeis free of both races and deadlocks.A race occurs when two or moreprocesses, either executing concurrently or randomlyinterleaved, attempt to access the same variable and atleast one of the processes is storing into (writing) thatvariable. A binary race can therefore be either a "read-write" race or a "write-write" race.A deadlock occurs when two ormore processes are waiting for events (e.g., changingvariable values) such that none of the events will occurbecause each event depends on the prior completion ofanother such event.Although races cannot always be detected by static flowanalysis, the potential for a race can be detectedstatically and a warning generated.As for detection of deadlock, consider a binary deadlockbetween processes Pi and Pj that occurs when Pi waits forevent Ej before causing event Ei and Pj waits for event Eibefore causing event Ej. Let the place where process Piwaits be called Wi, and the place where process Pj waits becalled Wj. In terms of deadlock detection, there are twoways in which this deadlock can occur:The code is written such that for every possible "parallelflow path" the execution of both Wi and Wj occurs beforeeither Ei or Ej has executed. In general, this type of racecan be detected using static (compile-time) flow analysis.The code is written such that some, but not every, "parallelflow path" places the execution of both Wi and Wj beforeeither Ei or Ej has executed. Although this is not alwaysstatically detectable, this by definition is a problemsynthesized from two read-write races -- one for Ej-Wi andone for Ei-Wj.Hence, although it is not possible for a compiler to detectall deadlocks, detecting the first type combined withdetecting potential races willinsure that each deadlock is at least flagged with a warningmessage. The problem then becomes one of limiting thewarnings to those potential races thatactually are races and that were notintended.An important realization concerning parallel algorithms isthat some programs do not require deterministic execution.For example, a weather prediction algorithm might sacrificedeterministic execution to gain speed, since the increasedspeed would allow use of more up-to-date input data.Nondeterministic execution would be preferred over adeterministic model in such a case, since more accurateforecasts could be generated.Since none of the asynchronous languages provide a constructfor specifying whether or not non-deterministic execution isacceptable, it is not possible to use the above analysis tofull advantage. This makes programs much more difficult todebug and maintain -- a problem notoriously associated withthe asynchronous languages.Programming models such as those provided by the Force andEPEX, however, make specification of parallel structure veryeasy. Using a model based on the idea of multiple control-flow threads running through the same code, these modelsinvoke a type of structured parallel execution that isremarkably similar to that found in synchronous executionmodels.Thus, the most important aspect of the SPMD models is thatthey allow relatively complex asynchronous parallelstructure to be embodiedwithin language constructs, exactly as is done insynchronous languages. This makes it possible to merge thefundamental characteristics of both synchronous andasynchronous execution in a single programming model thatcan be efficiently implemented on a number of architectures.

4. UNIFICATION OF SYNCHRONOUS AND ASYNCHRONOUS MODELS

4.1. Characteristics of a Unified Model The wide variety of synchronous and asynchronous languagesdeveloped over the past several years has led to theevolution of two very different classes of parallelarchitectures. These classes -- SIMD and MIMD -- can bebroken into several subclasses based on specific machinecharacteristics, but the fundamental difference in the twoclasses involves the number of possible control flow pathsduring execution of a program. A SIMD execution modelcontains only a single conceptual "program counter,"regardless of the number of physical processors present inthe system. A MIMD model, conversely, may have any numberof "program counters" available during execution. Asdiscussed in previous chapters, language constructs havebeen developed for each class of architecture, generallyresulting in programming environments that are not portable,and often not scalable. The apparent incompatibility ofsynchronous and asynchronous constructs is manifested in twoissues related to the number of instruction streams possiblein the execution model of a particular architecture --control structure and data management.Proper specification of control structure and datamanagement semantics leads to a model where determinism canbe specified as an independent characteristic of parallelprogram structure. 4.2. Control Structure Control flow in synchronous languages is usually expressedwithin a construct, much as ina sequential language. Asynchronous parallel control,however, is typically expressed through the interactions ofmultiple sequential "threads" of execution. Since SIMDarchitectures conceptually use only a single programcounter, synchronous par