PiSMA: A Parallel VSM ArchitecturePiSMA: A Parallel VSM Architectureby Dimitris Lioupis, AndreasPipis, Maria Smirli, and MichaelStefanidakisIntroductionThere is an increased interest in large scale multiprocessorscomposed of a large number of processing elements. The two main modelsof multiprocessor architectures, the shared memory model [10] and the distributed memory model [1, 2, 3,4, 11], suffer from seriouslimitations because of their nature. Distributed memory architecturespromise better scalability and increased performance but are moredifficult to program efficiently. Shared memory architectures, on theother hand, have a better programming model. Their disadvantage,however, is the limited bandwidth of the shared bus, which restrictsscalability. This paper presents a parallel computer architecture, called PiSMA(Parallel vIrtually Shared Memory Architecture), introduced in [5, 6]. PiSMA architecturecombines the benefits of both multiprocessor models mentioned above,resulting in an efficient parallel architecture. The simplicity of theunderlying hardware permits the scaling of PiSMA, utilizing newgenerations of microprocessors, without the need to redesign the wholesystem. PiSMA can be used with or without a global communicationnetwork depending on the type of applications it is used for, asdescribed in [9]. In the following sections adetailed description of the PiSMA architecture, its programming model,the Virtual Memory management and message passingsupport mechanisms are presented. Description of the PiSMA ArchitectureThe PiSMA architecture forms an expandable toroidal grid withalternating processors and memories: each processor is connected tofour memories and each memory to four processors as shown in Figure 1. This structure enables each processor to communicate directly(through a common memory) with up to eight neighboring processors asshown in Figure 2. Communication with any other processor beyond these eight processors onthe grid is performed by message passing which is transparentto the user. The four ports to memory are implemented as four cachesconnected on the same bus with the memory module as depicted in Figure 3. The processor-memory cluster consists of two basic boards, namelythe processor board and the memory board. The processor board has twoversions implementing the desired connectivity and allowing for easyexpansion. The memory board contains four memory modules. Byconnecting the top and the bottom connector and the leftand the right connector, the smallest configuration possible (fourprocessors and four memories) is formed. The physical implementationof the architecture is described in detail in [5]. PiSMA is easily scalable, and biggerconfigurations can be easily constructed by repeatedly connectingtogether the basic building blocks shown in Figure3. A very important aspect of the basic PiSMA architecture is that nonetwork for intra-processor communication exists. Memory modules areused as a communication media instead, with messages being transferredto their destinations via consecutive memory-to-memory copying. Thiscopy operation is performed by processors common to each pair ofmemories. It was shown in [9] that this approachworks well for a wide range of parallel applications and that theaddition of a global communication network does not improveperformance significantly. Moreover, the absence of a network has theadvantage of easier scalability. In the case of applications withheavy global communication, however, the presence of anintra-processor communication network does improve the system'sperformance. In order to satisfy application demands, a networkcan be easily incorporated into the basic PiSMA hardware, as shown in[7], where PiSMA was used to execute specialpurpose applications such as Video on Demand.Fault tolerance is another important feature of PiSMA architecture; this capability comes from its structure shown in Figure 3, that provides alternative paths that can be used when a component(processor or memory) fails, as described in [9].Programming modelPrograms are executed on PiSMA by mapping the execution tree of aparallel program onto the grid of processors. The code is loaded in amemory module and the root of the tree (which is the first call to themain procedure) is randomly assigned to one of the adjacent processors. This processor starts executing the code and spreads work outdynamically to its neighbors, through its four adjacent memories. Thework distribution is performed by the dynamic load balancer,described in detail in [8]. Distributing workconsists of copying a work granule into a processor's adjacent memoryand inserting an entry into its task queue. A work granule is a self-contained piece of work (typically 50-500 lines of code). Each work granule consists of: a header, indicating the resources required for execution of this particular work granule the body, which contains the code to be executed. The dynamic load balancing algorithm is a key element in ourarchitecture. This algorithm attempts to evenly distribute the workgranules on the processing surface, avoiding, as much as possible,data transfers and creation of long communication paths in shared dataaccess. In order to achieve these conflicting goals, the loadbalancing algorithm draws on information included in the header ofeach work granule (such as code size and shared variable dependencies)to decide which neighboring processor is best suited to be assigned apiece of work. This decision is also based on data dependencies andeach processor's load.Virtual Memory ManagementThe virtual memory management mechanism of the PiSMA architecturesupports the mixed addressing mode (direct access and messagepassing) of the architecture. The selection of the appropriateaddressing mode is completely dynamic and transparent to the applicationprogrammer. The dynamic nature of this mechanism imposes the fact thatresources' physical locations on PiSMA are unknown until runtime.The PiSMA programming model distinguishes between shared andprivate data. Private data are expected to reside in one of the fourmemories adjacent to the executing processor and are referencedrelatively through an offset. Referencing private variables isstraightforward; the processor presents the real address on its busand reads from the appropriate memory. There is no need for specialhardware to perform this transaction; the processor sees its fourneighboring memories as a uniform and contiguous memory space. Anycompiler that references data locally (as the majority of compilersdo) into the data segment can be used to handle this kind of data.Shared data may or may not be in an adjacent memory and arealways referenced through indirection pointers. When theapplication program executes, shared data may be directly accessible(if they reside in an adjacent memory to the executing processor memory), orthey may be remotely accessed (via message passing). Therefore it isnot possible to directly reference this kind of data in an absolutemanner, since its location in the system before runtime isunknown. The application compiler generates the appropriate code to supportthis accessing scheme. For each shared resource in the applicationprogram the respective indirection pointers are allocated. After that,shared resources are referenced through these pointers.The contents of indirection pointers are undefined at compile time. At runtime (specifically, during the mapping of thework onto the PiSMA grid), the location of shared variables isestablished when they are used for the first time and the associatedindirection pointers are initialized. When a shared variable is directly accessible by a processor, thecorresponding indirection pointer contains its physical address inone of the four adjacent memories. The executing processor simplyexecutes an indirect access to its physically connected memoryspace. In this way there is no special hardware or softwareintervention; shared data can be accessed as local data.This is not the case when the shared data resources are remote tothe executing processor. The value contained in the indirectionpointer in this case is a special memory address, designated togenerate an OS trap. The indirection pointer contains additionalinformation about the shared data location in the system, such as thephysical memory module, the address and usage status of thisparticular shared resource. The generated OS trap service routinetranslates the request into a remote-data requesting message,utilizing the resource's location information from the indirectionpointer.It should be clear that the PiSMA architecture does not utilize apure global virtual memory address space scheme. That is, theaddresses generated describe uniquely a physical location in thewhole virtual memory space and a translation hardware mechanismexists to map these addresses into physical space. Instead, PiSMA usesphysical addresses when accessing data directly, and relies on asoftware translation mechanism to generate messages for remote datarequests.PiSMA does require that its data have a fixedphysical location in the system's memory when it is used, as it is anon-Cache Only Memory architecture [11]. The fixed location of data resources in useavoids data coherency problems, and requires no global datainvalidation messages. Message Passing SupportMessages generated from the operating system are forwarded totheir destination through system memory. Each processor has amessage queue defined in each of its four adjacent memories. In thesequeues the eight neighboring processors store the messages, whoserecipient is this particular processor. During the processor's normaloperation, an interrupt driven software message handler queries thequeues and services the pending messages. Messages are forwarded to their destination by packetswitching from processor to processor. Each message is equippedwith a header describing its type, its destination processor'slocation on the grid, and its originator (if this is necessary). Themessage route through the processor grid is not pre-established whenthe message is generated. Each processor, uponreceiving a message, decides to forward it to one of its eightneighboring processors, or to consume it itself. The forwardingdirection is found by comparing the destination processor's location,from the message header, with the forwarding processor's own location(it is assumed that each processor has the knowledge of itsown location).Message passing always imposes high latency to the application'sexecution. Messages kept in main memory are uncachable because oftheir nature, and message reading or writing operations take manyexecution cycles. The time spent for message handling on eachprocessor is another source of overhead, delaying the applicationexecution. The PiSMA architecture tries to minimize the message passinglatency by avoiding it or hiding it. In the first case, theadvantage of the physical communication through common adjacentmemories is used, avoiding the generation of messages. Processorsworking with shared resources on a common memory can execute veryquickly, without message intervention. It is the responsibility of thediffusion algorithm to assign work granules with common resources toneighboring processors, but efficient distribution of processes isnot always possible. Sometimes the diffusion algorithm hasinsufficient information to perform an optimal job distribution, andat other times the parallel application does not allow the distributionof its data set in different system memories. The result is anincrease in message traffic (which could be excessive). In order to minimize theimpact of such a situation, the operating system of PiSMA is designedto swap out processes waiting for remote data. In this way theprocessor can execute something useful, thus hiding wait time. ConclusionPiSMA is a Virtually Shared Memory Architecture that combines boththe benefits of the shared and distributed models. A detaileddescription of the PiSMA Architecture, its Virtual Memory Managementand Message Passing support was presented and it was shown thatparallel programming in PiSMA is an easy task, without the need forknowledge of the underlying hardware. All these details aretransparent to the programmer since the operating system kernel ofPiSMA takes care of them. The encapsulation of intelligent dynamicalgorithms in the operating system kernel, such as the diffusionalgorithm and the dynamic memory management, makes PiSMA a powerfulparallel distributed memory computer with an easy programminginterface compared to that of the pure shared memoryarchitectures.References1 A. Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D. Kranz, J.Kubiatowicz,B. Lim, K. Mackenzie, D. Yeung, The MIT Alewife Machine: ArchitectureandPerformance. In Proceedings of the 22nd Int'l Symp. on ComputerArchitectures,Los Alamitos, California 1995 2S. Borkar et al. iWARP: An Integrated Solution to High SpeedParallelComputing. In Proceedings of Supercomputing'88, Nov. 1988.3J.T. Kuehn, B.J. Smith, The HORIZON Supercomputing System:Architectureand Software. In Proceedings of Supercomputing'88, Nov. 1988.4D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J.Hennessy,M. Horowit, and M.S. Lam. The Stanford DASH multiprocessor. IEEEComputerJ., 25(3), pp 63-79, March 1992.5D. Lioupis, N. Kanellopoulos, CHESS Multiprocessor: AProcessor-MemoryGrid for Parallel Programming. In Cache and Interconnect Architecturesin Multiprocessors, edited by M. Dubois & T. Thakard, KluwerAcademicPublishers, pp245-257, June 1989. 6D. Lioupis, N. Kanellopoulos, M. Stefanidakis, The Memory Hierarchyof the CHESS Computer. Microprocessing and Microprogramming (JSA)(vol 38), pp 99-107, Sept 1993. 7D. Lioupis, A. Pipis, M. Smirli, N. Kanellopoulos, Architecture ofa VSM Parallel Video-on-demand server. Appeared in the specialsessionon Communication and Computing for Distributed Multimedia Systems of the10th International Conference on Parallel and Distributed Computing andSystems, Las Vegas, Nevada, October 28-31, 1998 8D. Lioupis, M. Stefanidakis, Dynamic Load Balancing on aVirtually-SharedMemory Parallel Computer System. In Proceedings of the 6thInternationalPARLE Conference, pp813-819, Athens, July 1994. 9D. Lioupis, A. Pipis, M. Stefanidakis, PiSMA: An Upgradeable FaultTolerant Approach to Parallel Processing. In Proceedings of the 4thIEEE International Conference on High Performance Computing,pp277-283,Bangalore India, Dec 1997.10C. Thaker, L. Stewart, Firefly: A Multiprocessor Workstation. InProceedingsof the 2nd International Conference on Architectural Support forProgrammingLanguages and Operating Systems, ACM, October 198711D. H. D. Warren and S. Haridi, The Data Diffusion Machine - AScalableShared Virtual Memory Multiprocessor, In Proceedings of the 1988 Int.Conf. on Fifth Generation Computer Systems, pp 943-952, Tokyo, Japan, Dec. 1988 Dimitris Lioupis (lioupis@cti.gr)is an Assistant Professorat the Department of Computer Engineering and Informatics, Universityof Patras, Greece. Andreas Pipis (pipis@cti.gr), Maria Smirli (smirli@cti.gr) and Michael Stefanidakis (mistral@cti.gr) are PhD students at the Department of Computer Engineering and Informatics,University of Patras, Greece.Want more Crossroads articles about Computer Architecture?Go to the |
|