Back to the main page of LSP/EPFL Peripheral Systems Laboratory (EPFL-DI/LSP)
[Research] [GigaServer]

The CAP parallel extension to C++

The CAP (Computer Aided Parallelization) language is a general-purpose parallel extension to C++, developed in the context of PS2 architectures (Parallel Storage and Processing Servers). CAP enables application programmers to hierarchically specify the macro dataflow between operations performed on tokens (application data subsets). Operations are segments of sequential code performed by a single execution thread and characterized by a single input and a single output. The input- and output- values of an operation are called tokens. A macro-dataflow graph specifies how tokens are routed between the operations of the parallel program, i.e. the scheduling of operations. The job of the CAP programmer is to specify tokens (i.e. C++ data structures) exchanged between operations, and the scheduling of sequential operations required to perform a parallel operation.

[Figure] FIGURE 3. Graphical representation of the parallel 2-D plane extraction algorithm.

Figure 3 is a graphical representation of the parallel 2-D plane extraction. At the left, the user supplies plane extraction parameters. The plane extraction parameters are divided in several extent requests which are sent to the appropriate extent server. The selection of the extent server is dynamic and dependent on the plane specified by the user and the allocation of extents to S/P nodes. After the extent has been read by the extent-server, it is transferred to the compute-server for plane part extraction. Extents handled by the same S/P node are processed in pipeline fashion, i.e. the S/P node extracts plane parts from the current extent while fetching the next extent.

  1 // tokens
  2  token PlaneExtractionParametersT {
  3    // C++ class fields
  4  } ;
  5  token ExtentReadingRequestT {
  6    // C++ class fields
  7  } ;
  8  token PlanePartT {
  9    // C++ class fields
 10  } ;
 11  token PlaneT {
 12    // C++ class fields
 13  } ;
 15  // thread hierarchy
 16  process ComputeServerT {} ;
 17  process ExtentServerT {
 18  operations :
 19    ReadExtent
 20      in ReadExtentRequestT* Input
 21      out ExtentT* Output ;
 22  } ;
 23  process ClientProcessT {} ;
 24  process Ps2ServerT {
 25  subprocesses :
 26    ExtentServerT ExtentServer[4] ;
 27    ComputeServerT ComputeServer[4] ;
 28    ClientProcessT Client ;
 29  operations :
 30  } ;

 31  int SplitPlaneParameters
 32    (  PlaneExtractionParametersT* FromP
 33    ,  ReadExtentRequestT*& ThisP
 34    )
 35  { // C++ code }
 37  void MergePlaneParts
 38    (  PlaneT* IntoP, PlanePartT* FromP )
 39  { // C++ code }
 42  leaf operation ComputeServerT::PlanePartExtraction
 43    in  ExtentT* Input
 44    out PlanePartT* Output
 45  { // C++ code }
 48  operation Ps2ServerT::PlaneExtraction
 49    in  PlaneExtractionParametersT* Input
 50    out PlaneT* Output
 51  {
 52    parallel while
 53      ( SplitPlaneParameters, MergePlaneParts
 54      , Client, PlaneT Result)
 55      (
 56      ExtentServer[thisTokenP->ESIndex].ReadExtent
 57      >->
 58      ComputeServer[thisTokenP->CSIndex].PlanePartExtraction
 59    );
 60  }
FIGURE 4. CAP specification of the 2-D plane extraction

Figure 4 is the CAP specification of the 2-D plane extraction. The tokens (line 2 to 13) exchanged between operations in the CAP program are the PlaneExtractionParametersT, the ExtentReadingRequestT, the PlanePartT and the PlaneT. The thread hierarchy describes the threads running on the parallel architecture. There is a disk-access thread (ExtentServer[*], line 26) and a computation thread (ComputerServer[*], line 27) for each S/P node, and a thread representing the user (Client, line 28). The ExtentServer features one predefined operation, ReadExtent. The parallel plane extraction (lines 48 to 60) consists of dividing, using the SplitPlaneParameters function, the input PlaneExtractionParametersT token (line 49) in a number of extent requests, send the extent requests to the appropriate extent server and read the extents (line 56), extract a plane tile from the extents (line 58), send the plane tile to the Client (line 54, first parameter) to be merged into the whole extracted plane (line 54, second parameter). Lines 52 to 58 specify all the synchronization and communication requirements of the pipeline/parallel 2-D plane extraction. The specification of Figure 4 (lines 52 to 58) matches closely the diagram of Figure 3.

The CAP programming model is a distributed-memory model, where the application programmer controls data transfers at a high level of abstraction. Thanks to the CAP formalism, application programmers need not program data-transfers- and synchronization- protocols. CAP specifications are translated to a parallel library supporting three primitives : spawning a process, sending a message and receiving a message. The simple library interface makes it easy to port the CAP runtime from one architecture to another. The CAP preprocessor produces a single executable than can be run without recompilation in many different configurations : a single thread performing all operations, to facilitate the use of the debugger ; all threads in the same process ; arbitrary allocation of threads to multiple processes.

The CAP extension supports the specification of pipeline/parallel applications required for PS2 architectures (section 1), as well as the specification of more traditional parallel applications such as linear algebra algorithms (LaPack). Its semantics is based on Directed Acyclic Graphs (DAGs). DAGs are general, i.e. they allow to specify arbitrary scheduling of sequential operations. DAGs consists of operations, split-points and merge-points. For example, the DAG of Figure 3 consists of:

  1. one split-point called SplitPlaneParameters ;
  2. n operations called ReadExtent performed by the ExtentServer threads ;
  3. n operations called PlanePartExtraction performed by the ComputeServer threads ; and
  4. one merge-point called MergePlaneParts.

The acyclicity of DAGs guarantee that CAP programs will be deadlock free. The CAP specification of symmetric DAGs, i.e. DAGs where all operations starting at a given split-point finish at the same merge-point, are acyclic by construction, and therefore produce parallel programs that are deadlock-free ; in the case of asymmetric DAGs, deadlocks are reported clearly by the runtime, by making reference to the parallel operation causing the deadlock, and showing the set of sequential operations leading to the deadlock situation. The DAG of Figure 3 is a symmetric DAG. DAGs are also used in prototyping systems such as Rapide [3] and parallel programming languages such as Mentat [2].

The performance of CAP programs results from 3 characteristics of the CAP language :

  1. it is always possible to minimize the number of data transfers in a CAP specification, because CAP supports references to global variables in each aliress space ;
  2. each data transfer in a CAP program entails a 20-Byte overhead for each data transfer ;
  3. its is always possible to minimize the number of data copies.

The 20-byte overhead is very low compared to the latency of even the fastest networks. The language is hierarchical and compositional, i.e. it is possible to reuse parallel operations. For example, the parallel LU factorization of a matrix requires a parallel matrix multiplication. In CAP, reusing the pipeline/parallel matrix multiplication simply involves calling the parallel multiplication operation. The CAP language supports overlapped communication and computation, allowing to overcome high network latencies typical of the TCP/IP protocol stack. The development of a CAP program requires three steps:

  1. divided-domain sequential-program (i.e. a sequential program where the data set is divided in data subsets, and where the algorithm is decomposed in procedures operating on data subsets) ;
  2. CAP program specification (a token for each data-subset type, the thread hierarchy, the scheduling of operations in the form of DAGs) ;
  3. debugging.

The debugging of a CAP program itself is done in three steps, none of which requires recompilation :

  1. running the CAP program sequentially to remove memory management errors ;
  2. running the CAP program with multiple threads in a single process ;
  3. running the CAP program with multiple threads in multiple processes.

The CAP language extension improves the productivity of the parallel-application developer for four reasons :

  1. CAP is compositional, i.e. parallel operations can be used in other parallel programs ;
  2. CAP specified symmetric DAGs (the most common form of CAP programs) are deadlock-free by construction ;
  3. the same CAP program can be run single-threaded, multi-threaded, and multi-process without recompilation, allowing the validation of the program at every step the parallel-program development process ;
  4. the programmer does not need to program the protocols required to exchange procedure operation parameters and data between aliress spaces.


<basile.schaeli@epfl(add: .ch)>
Last modified: 2007/09/26 21:22:10