The CAP parallel extension to C++

The CAP (Computer Aided Parallelization) language is a general-purpose parallel extension to C++, developed in the context of PS² architectures (Parallel Storage and Processing Servers). CAP enables application programmers to hierarchically specify the macro dataflow between operations performed on tokens (application data subsets). Operations are segments of sequential code performed by a single execution thread and characterized by a single input and a single output. The input- and output- values of an operation are called tokens. A macro-dataflow graph specifies how tokens are routed between the operations of the parallel program, i.e. the scheduling of operations. The job of the CAP programmer is to specify tokens (i.e. C++ data structures) exchanged between operations, and the scheduling of sequential operations required to perform a parallel operation.

Figure 3 is a graphical representation of the parallel 2-D plane extraction. At the left, the user supplies plane extraction parameters. The plane extraction parameters are divided in several extent requests which are sent to the appropriate extent server. The selection of the extent server is dynamic and dependent on the plane specified by the user and the allocation of extents to S/P nodes. After the extent has been read by the extent-server, it is transferred to the compute-server for plane part extraction. Extents handled by the same S/P node are processed in pipeline fashion, i.e. the S/P node extracts plane parts from the current extent while fetching the next extent.


  1 // tokens
  2  token PlaneExtractionParametersT {
  3    // C++ class fields
  4  } ;
  5  token ExtentReadingRequestT {
  6    // C++ class fields
  7  } ;
  8  token PlanePartT {
  9    // C++ class fields
 10  } ;
 11  token PlaneT {
 12    // C++ class fields
 13  } ;
 14
 15  // thread hierarchy
 16  process ComputeServerT {} ;
 17  process ExtentServerT {
 18  operations :
 19    ReadExtent
 20      in ReadExtentRequestT* Input
 21      out ExtentT* Output ;
 22  } ;
 23  process ClientProcessT {} ;
 24  process Ps2ServerT {
 25  subprocesses :
 26    ExtentServerT ExtentServer[4] ;
 27    ComputeServerT ComputeServer[4] ;
 28    ClientProcessT Client ;
 29  operations :
 30  } ;


 31  int SplitPlaneParameters
 32    (  PlaneExtractionParametersT* FromP
 33    ,  ReadExtentRequestT*& ThisP
 34    )
 35  { // C++ code }
 36
 37  void MergePlaneParts
 38    (  PlaneT* IntoP, PlanePartT* FromP )
 39  { // C++ code }
 40
 41
 42  leaf operation ComputeServerT::PlanePartExtraction
 43    in  ExtentT* Input
 44    out PlanePartT* Output
 45  { // C++ code }
 46
 47
 48  operation Ps2ServerT::PlaneExtraction
 49    in  PlaneExtractionParametersT* Input
 50    out PlaneT* Output
 51  {
 52    parallel while
 53      ( SplitPlaneParameters, MergePlaneParts
 54      , Client, PlaneT Result)
 55      (
 56      ExtentServer[thisTokenP->ESIndex].ReadExtent
 57      >->
 58      ComputeServer[thisTokenP->CSIndex].PlanePartExtraction
 59    );
 60  }

FIGURE 4. CAP specification of the 2-D plane extraction

Figure 4 is the CAP specification of the 2-D plane extraction. The tokens (line 2 to 13) exchanged between operations in the CAP program are the PlaneExtractionParametersT, the ExtentReadingRequestT, the PlanePartT and the PlaneT. The thread hierarchy describes the threads running on the parallel architecture. There is a disk-access thread (ExtentServer[*], line 26) and a computation thread (ComputerServer[*], line 27) for each S/P node, and a thread representing the user (Client, line 28). The ExtentServer features one predefined operation, ReadExtent. The parallel plane extraction (lines 48 to 60) consists of dividing, using the SplitPlaneParameters function, the input PlaneExtractionParametersT token (line 49) in a number of extent requests, send the extent requests to the appropriate extent server and read the extents (line 56), extract a plane tile from the extents (line 58), send the plane tile to the Client (line 54, first parameter) to be merged into the whole extracted plane (line 54, second parameter). Lines 52 to 58 specify all the synchronization and communication requirements of the pipeline/parallel 2-D plane extraction. The specification of Figure 4 (lines 52 to 58) matches closely the diagram of Figure 3.

The CAP programming model is a distributed-memory model, where the application programmer controls data transfers at a high level of abstraction. Thanks to the CAP formalism, application programmers need not program data-transfers- and synchronization- protocols. CAP specifications are translated to a parallel library supporting three primitives : spawning a process, sending a message and receiving a message. The simple library interface makes it easy to port the CAP runtime from one architecture to another. The CAP preprocessor produces a single executable than can be run without recompilation in many different configurations : a single thread performing all operations, to facilitate the use of the debugger ; all threads in the same process ; arbitrary allocation of threads to multiple processes.

The CAP extension supports the specification of pipeline/parallel applications required for PS² architectures (section 1), as well as the specification of more traditional parallel applications such as linear algebra algorithms (LaPack). Its semantics is based on Directed Acyclic Graphs (DAGs). DAGs are general, i.e. they allow to specify arbitrary scheduling of sequential operations. DAGs consists of operations, split-points and merge-points. For example, the DAG of Figure 3 consists of:

The acyclicity of DAGs guarantee that CAP programs will be deadlock free. The CAP specification of symmetric DAGs, i.e. DAGs where all operations starting at a given split-point finish at the same merge-point, are acyclic by construction, and therefore produce parallel programs that are deadlock-free ; in the case of asymmetric DAGs, deadlocks are reported clearly by the runtime, by making reference to the parallel operation causing the deadlock, and showing the set of sequential operations leading to the deadlock situation. The DAG of Figure 3 is a symmetric DAG. DAGs are also used in prototyping systems such as Rapide [3] and parallel programming languages such as Mentat [2].

The performance of CAP programs results from 3 characteristics of the CAP language :

The 20-byte overhead is very low compared to the latency of even the fastest networks. The language is hierarchical and compositional, i.e. it is possible to reuse parallel operations. For example, the parallel LU factorization of a matrix requires a parallel matrix multiplication. In CAP, reusing the pipeline/parallel matrix multiplication simply involves calling the parallel multiplication operation. The CAP language supports overlapped communication and computation, allowing to overcome high network latencies typical of the TCP/IP protocol stack. The development of a CAP program requires three steps:

The debugging of a CAP program itself is done in three steps, none of which requires recompilation :

The CAP language extension improves the productivity of the parallel-application developer for four reasons :