The CAP (Computer Aided Parallelization) language is a general-purpose parallel extension to C++, developed in the context of PS2 architectures (Parallel Storage and Processing Servers). CAP enables application programmers to hierarchically specify the macro dataflow between operations performed on tokens (application data subsets). Operations are segments of sequential code performed by a single execution thread and characterized by a single input and a single output. The input- and output- values of an operation are called tokens. A macro-dataflow graph specifies how tokens are routed between the operations of the parallel program, i.e. the scheduling of operations. The job of the CAP programmer is to specify tokens (i.e. C++ data structures) exchanged between operations, and the scheduling of sequential operations required to perform a parallel operation.
Figure 3 is a graphical representation of the parallel 2-D plane extraction. At the left, the user supplies plane extraction parameters. The plane extraction parameters are divided in several extent requests which are sent to the appropriate extent server. The selection of the extent server is dynamic and dependent on the plane specified by the user and the allocation of extents to S/P nodes. After the extent has been read by the extent-server, it is transferred to the compute-server for plane part extraction. Extents handled by the same S/P node are processed in pipeline fashion, i.e. the S/P node extracts plane parts from the current extent while fetching the next extent.
1 // tokens 2 token PlaneExtractionParametersT { 3 // C++ class fields 4 } ; 5 token ExtentReadingRequestT { 6 // C++ class fields 7 } ; 8 token PlanePartT { 9 // C++ class fields 10 } ; 11 token PlaneT { 12 // C++ class fields 13 } ; 14 15 // thread hierarchy 16 process ComputeServerT {} ; 17 process ExtentServerT { 18 operations : 19 ReadExtent 20 in ReadExtentRequestT* Input 21 out ExtentT* Output ; 22 } ; 23 process ClientProcessT {} ; 24 process Ps2ServerT { 25 subprocesses : 26 ExtentServerT ExtentServer[4] ; 27 ComputeServerT ComputeServer[4] ; 28 ClientProcessT Client ; 29 operations : 30 } ; |
31 int SplitPlaneParameters 32 ( PlaneExtractionParametersT* FromP 33 , ReadExtentRequestT*& ThisP 34 ) 35 { // C++ code } 36 37 void MergePlaneParts 38 ( PlaneT* IntoP, PlanePartT* FromP ) 39 { // C++ code } 40 41 42 leaf operation ComputeServerT::PlanePartExtraction 43 in ExtentT* Input 44 out PlanePartT* Output 45 { // C++ code } 46 47 48 operation Ps2ServerT::PlaneExtraction 49 in PlaneExtractionParametersT* Input 50 out PlaneT* Output 51 { 52 parallel while 53 ( SplitPlaneParameters, MergePlaneParts 54 , Client, PlaneT Result) 55 ( 56 ExtentServer[thisTokenP->ESIndex].ReadExtent 57 >-> 58 ComputeServer[thisTokenP->CSIndex].PlanePartExtraction 59 ); 60 } |
Figure 4 is the CAP specification of the 2-D plane extraction. The tokens (line 2 to 13) exchanged between operations in the CAP program are the PlaneExtractionParametersT, the ExtentReadingRequestT, the PlanePartT and the PlaneT. The thread hierarchy describes the threads running on the parallel architecture. There is a disk-access thread (ExtentServer[*], line 26) and a computation thread (ComputerServer[*], line 27) for each S/P node, and a thread representing the user (Client, line 28). The ExtentServer features one predefined operation, ReadExtent. The parallel plane extraction (lines 48 to 60) consists of dividing, using the SplitPlaneParameters function, the input PlaneExtractionParametersT token (line 49) in a number of extent requests, send the extent requests to the appropriate extent server and read the extents (line 56), extract a plane tile from the extents (line 58), send the plane tile to the Client (line 54, first parameter) to be merged into the whole extracted plane (line 54, second parameter). Lines 52 to 58 specify all the synchronization and communication requirements of the pipeline/parallel 2-D plane extraction. The specification of Figure 4 (lines 52 to 58) matches closely the diagram of Figure 3.
The CAP programming model is a distributed-memory model, where the application programmer controls data transfers at a high level of abstraction. Thanks to the CAP formalism, application programmers need not program data-transfers- and synchronization- protocols. CAP specifications are translated to a parallel library supporting three primitives : spawning a process, sending a message and receiving a message. The simple library interface makes it easy to port the CAP runtime from one architecture to another. The CAP preprocessor produces a single executable than can be run without recompilation in many different configurations : a single thread performing all operations, to facilitate the use of the debugger ; all threads in the same process ; arbitrary allocation of threads to multiple processes.
The CAP extension supports the specification of pipeline/parallel applications required for PS2 architectures (section 1), as well as the specification of more traditional parallel applications such as linear algebra algorithms (LaPack). Its semantics is based on Directed Acyclic Graphs (DAGs). DAGs are general, i.e. they allow to specify arbitrary scheduling of sequential operations. DAGs consists of operations, split-points and merge-points. For example, the DAG of Figure 3 consists of:
The acyclicity of DAGs guarantee that CAP programs will be deadlock free. The CAP specification of symmetric DAGs, i.e. DAGs where all operations starting at a given split-point finish at the same merge-point, are acyclic by construction, and therefore produce parallel programs that are deadlock-free ; in the case of asymmetric DAGs, deadlocks are reported clearly by the runtime, by making reference to the parallel operation causing the deadlock, and showing the set of sequential operations leading to the deadlock situation. The DAG of Figure 3 is a symmetric DAG. DAGs are also used in prototyping systems such as Rapide [3] and parallel programming languages such as Mentat [2].
The performance of CAP programs results from 3 characteristics of the CAP language :
The 20-byte overhead is very low compared to the latency of even the fastest networks. The language is hierarchical and compositional, i.e. it is possible to reuse parallel operations. For example, the parallel LU factorization of a matrix requires a parallel matrix multiplication. In CAP, reusing the pipeline/parallel matrix multiplication simply involves calling the parallel multiplication operation. The CAP language supports overlapped communication and computation, allowing to overcome high network latencies typical of the TCP/IP protocol stack. The development of a CAP program requires three steps:
The debugging of a CAP program itself is done in three steps, none of which requires recompilation :
The CAP language extension improves the productivity of the parallel-application developer for four reasons :