# Under consideration for publication in J. Functional Programming 1 Compilation of a Special.pdf

Under consideration for publication in J. Functional Programming 1Compilation of a Specialized FunctionalLanguage for Massively Parallel ComputersPascal FRADET and Julien MALLETIRISA, Campus de Beaulieu,35042 Rennes, France(e-mail: ffradet,malletg@irisa.fr)AbstractWe propose a parallel specialized language that ensures portable and cost-predictable im-plementations on parallel computers. The language is basically a rst-order, recursion-less,strict functional language equipped with a collection of higher-order functions or skeletons.These skeletons apply on (nested) vectors and can be grouped in four classes: computation,reorganization, communication, and mask skeletons. The compilation process is describedas a series of transformations and analyses leading to spmd-like functional programs whichcan be directly translated into real parallel code. The language restrictions enforce a pro-gramming discipline whose bene t is to allow a static, symbolic, and accurate cost analysis.The parallel cost takes into account both load balancing and communications, and canbe statically evaluated even when the actual size of vectors or the number of processorsare unknown. It is used to automatically select the best data distribution among a setof standard distributions. Interestingly, this work can be seen as a cross fertilization be-tween techniques developed within the Fortran parallelization, skeleton, and functionalprogramming communities.1 IntroductionA good parallel programming model must be portable and cost predictable. Gen-eral purpose languages such as Fortran achieve portability but cost estimationsare often very approximate. A precise cost analysis is especially important in thiscontext since the goal of parallelization is e ciency and its impact on the overallcost is, at best, a division by a constant. So, orders of magnitude or maximumcomplexities are insu cient to guide parallel implementation choices.The approach described in this paper is based on a restricted pure functionallanguage that is portable and allows us to design an automatic and accurate costanalysis. The language restrictions can be seen as enforcing a programming disci-pline that ensures a predictable performance on the target parallel computer (therewill be no \performance bugs“). General recursion and conditionals are replacedby skeletons that encapsulate control and data flow in the sense of (Cole, 1988)or (Darlington et al., 1993). The skeletons, which act on (potentially nested) vec-tors, can be grouped into four classes: the computation skeletons (classical dataparallel functions), the reorganization skeletons (creating and structuring vectors),the communication skeletons (data motion over vectors), and the mask skeletons2 P. Fradet and J. Mallet(conditional data parallel functions). The skeletons and data structures have beenchosen with scienti c computing in mind. Matrix computations and nested forloops are easy to describe and many standard numerical algorithms have been ex-pressed easily in our kernel language. Concerning the target parallel architecture,we aimed at mimd (Multiple Instructions Multiple Data) computers with shared ordistributed memory but simd (Single Instruction Multiple Data) computers couldbe accommodated as well.The compilation process is described as a series of program transformations lead-ing to spmd-like (Single Program Multiple Data) functional programs which canbe directly translated into true parallel code. Each compilation step transforms askeleton-based language into another closer to a code with explicit parallelism. Themain compilation steps consist of a size analysis, an update-in-place analysis, atransformation making all communications explicit, transformations implementingdata distribution, and a symbolic code analysis. Note that using a functional lan-guage avoids the dependence analysis needed by imperative languages to determinethe computations which can be executed in parallel. A key task in the compilationof data parallelism is the choice of the distribution since it determines both theload balancing and the communications between processors. In our approach, therestrictions imposed by the source language make it possible to choose automati-cally the globally best data distribution among a set of standard distributions. Thischoice relies on the cost analysis which evaluates accurate parallel execution times.One of the main challenges of our work was to tackle this problem without knowingthe actual size of vectors or the number of processors. In other words, the analysisought to be symbolic.The contributions of this work are both technical and methodological.The compilation is e cient and original: it integrates very di erent analysistechniques in a sequence of program transformations. Maybe the most impor-tant contribution lies in the de nition of the specialized language. We establishthe necessary language restrictions to ensure the accuracy of a symbolic costanalysis. The compilation takes advantage of the analysis to choose the bestparallel implementation among a set of standard implementations. The anal-yses for Fortran nested loops (Gupta Feautrier, 1994) donot evaluate precisely the cost of the communications. In particular, collectivecommunications (di usion, translation.) cannot be automatically detected inFortran programs. In our approach, the analysis takes into account bothload balancing and communication costs since the collective communicationsappear explicitly through communication skeletons. Most existing skeletonsimplementations (Blelloch et al., 1994; Darlington et al., 1995) are based onxed implementations for each skeleton. This local view may lead to redis-tributing data before each skeleton application and therefore be very ine -cient. Finding the best, global distribution is more complex but more e cientsince it takes into account compromises (i.e. trading a local cost increase fora global improvement).At the methodology level, this work is a rare example of cross-fertilizationCompilation of a Specialized Functional Language for Parallel Computers 3between three di erent research elds: the automatic parallelization of For-tran, skeleton based languages, and functional programming. The Fortrancommunity has worked on symbolic complexity analyses for subsets of For-tran. We adapt and extend this work to de ne the cost analysis in our con-text. Skeleton-based languages with their collection of data parallel functionsprovide a framework to de ne specialized parallel languages. We build uponthis approach and provide two new collections of skeletons to the program-mer: the communication skeletons and the mask skeletons. The communityof functional programming has produced a large body of work on typing,program transformation, and analysis. Our compilation process, made of acollection of program analyses and transformation, relies on this work.The article is structured as follows. Section 2 is an overview of the compilationprocess. Section 3 presents the source language and the target parallel language.Sections 4 to 8 describe each compilation step in turn. We report some experimentsdone on mimd distributed memory computers in Section 9. Section 10 justi esa posteriori the source language restrictions, reviews related work, and suggestsdirections for further research.2OverviewThe compilation process consists of a series of program transformations:L1GL#2F#2FL2EC#2F#2FL3ABS#2F#2FL4DIST#2F#2FL5OPT#2F#2FL6TRA#2F#2FParallel CodeEach arrow represents a transformation compiling a particular task by mappingskeleton programs from one intermediate language (Li)intoanother(Li+1). Thesource language (L1) is composed of a collection of higher-order functions (skele-tons) acting on vectors (see Section 3.1). It is primarily designed for a particulardomain where high performance is a crucial issue: numerical algorithms. L1is bestviewed as a parallel kernel language embedded in a general sequential language(e.g. C). Only, parts of programs written in L1will be executed in parallel whereasothers parts will be executed sequentially, for example, on the host computer of theparallel machine.The rst compilation step is the type/size analysis of L1programs. The analysiscomputes the shape (size) of all the vectors occurring in the program (Section 4).It must infer symbolic sizes because the sizes of input vectors may be unknown atcompile time. As a byproduct, the analysis infers conditions ensuring that no vectoraccess error occurs at runtime.The rst transformation (L1!L2) deals with in-place updating, a standardproblem in functional programming with aggregates (Section 5). The program isanalyzed to check that all vectors can be safely modi ed in place. If the programdoes not pass the analysis, it must be transformed by inserting explicit vector copies.This can be done automatically (the analysis indicates the places where to insertcopies) or manually (the programmer may want to restructure the program to insert4 P. Fradet and J. Malletfewer copies). This step is also used to guarantee that vectors are either returnedas result or are explicitly deallocated (i.e. a garbage collector is not needed).The transformation EC (L2!L3) makes all communications explicit (Section6). Intuitively, in order to execute an expression such as map ( x:x + y)inpar-allel, y must be broadcast to every processor before applying the function. Thetransformation makes this kind of communication explicit. In the language L3,allcommunications are expressed through skeletons.The transformations from L3to L6concern automatic data distribution (Section7). First, -abstractions and variables are removed by threading an explicit envi-ronment throughout the program (transformation ABS, Section 7.1). This trans-formation, reminiscent of abstraction algorithms, prepares the distribution trans-formation DIST (L4!L5). We consider a set of standard distributions of theinput vectors. A vector can be distributed cyclicly, by contiguous blocks, or allo-cated to a single processor. For a matrix (vector of vectors), this gives 9 possibledistributions (cyclic cyclic, block cyclic, row cyclic, etc.). Distribution transformsprograms so that they act on a single vector whose elements represent the pro-cessors. This implies, in particular, to change all vector accesses according to thedistribution (Section 7.2). Finally, distributed programs are optimized (OPT : L5!L6) using a set of local transformations (Section 7.3). After distribution, somevector copies become useless. They are removed in order to improve the sequentialexecution time. L6programs apply on a vector of processors and simulate an spmdcode.In order to choose the best distribution, an L4program is transformed accordingto all the possible distributions of its input parameters leading to a set of L6pro-grams. The symbolic cost of each version is evaluated and the smallest one chosen(Section 8). For most numerical algorithms, the number of input vectors is small andthis approach is practical. In other cases, we would have to rely on the programmerto prune the search space.The transformation TRA(L6! Parallel Code) is a straightforward translationof the spmd skeleton program to an imperative program with calls to a standardcommunication library. We currently use C with the mpi (Message Passing Inter-face) library along with the C compiler of the host machine (Section 9 presentsexperiments on an Intel Paragon XP/S and a Cray T3E).All the transformations are automatic. Nevertheless, the user can interact withthe compiler, for example to insert explicit copies in the L1!L2step, or to guidethe choice of the best distribution in L5.For each transformation Ti: Li!Li+1, three correctness properties must beproved. First, it must be shown that the transformation Titransforms programs ofLiinto programs of Li+1, formally:Property 18Prog 2Li)Ti[[Prog]] 2Li+1.Second, the transformation Timust preserve the semantics of programs, formally:Property 2Compilation of a Specialized Functional Language for Parallel Computers 58Prog 2Li, Ti[[Prog]] = ProgThird, it must be checked that the update-in-place property still holds on trans-formed programs.Property 38Prog 2Li(i1) , UP(Prog) )UP(Ti[[Prog]] )Since transformations are de ned on the structure of expressions, the proofs ofthese properties usually boil down to a routine inspection of the di erent cases. Dueto their number and length, we do not describe them in this paper. A few examplesof proofs are sketched in appendix B.The source language comprises 16 skeletons, plus a number of other constructions(pairs, operators,a ne expressions, .). Further, new skeletons are added as thelanguage gets closer to an spmd language. Presenting the 6 compilation steps (anal-yses and transformations) for the whole language would be lengthy and tiresome.After presenting the complete source and target languages in the next section, wechose to focus on a tiny sublanguage for the rest of the presentation. A simpleexample (written in the sublanguage) is taken throughout the paper and illustratesthe di erent steps.The treatment of a more complete language (having at least one skeleton of eachtype) can be found in appendix A. The interested reader will nd a description oftransformations, analyses, and proofs for the whole language in (Mallet, 1998a)1.A previous conference paper (Mallet, 1998b) focuses on the cost analysis and canbe seen as a short introduction to this work.3 Source and Target Languages3.1 The source language L1The source language L1is basically a strict, pure, rst-order, recursion-less func-tional language, extended with a collection of higher-order functions (the skele-tons). We have restricted ourselves to a reasonable number of standard skeletons;new ones, especially among the computation and reorganization classes, could beintegrated as well. The main data structure is the vector which can be nested tomodel multidimensional arrays.The syntax of L1is de ned in Figure 1 (its type system will be described inSection 4). A program is a main expression followed by de nitions. A de nition iseither a function de nition or the declaration of input variables with their types.The types of input vectors bear their numerical or symbolic size. An expression(nonterminal Exp1) is either an application, a tuple, a variable, or a constant.Functions are unary -abstractions, unary operators, or the prede ned iteratoriterfor. It can be de ned in Haskell (Hudak et al., 1992) as follows:1Actually, the language presented here is slightly di erent (it considers a larger class of maskskeletons) than the one in (Mallet, 1998a)6 P. Fradet and J. MalletProg1::= Exp1where Decl1Decl1::= Decl1Decl1j f =Fun1j x :: Type1Type1::= (Type1, ., Type1) j Vect LinF1Type1j Int j Float j BoolExp1::= Fun1Exp1j (Exp1,.,Exp1) j x j kFun1::= iterfor LinF1Fun1j Op1j (x1;:::;xn):Exp1j fj CompSkel1j ReorgSkel1j CommSkel1j MaskSkel1Op1::= + j− j jdiv j exp j log j cos j .LinF1::= LinF1+LinF1j LinF1− LinF1j k LinF1j x