# Exact and approximate aggregation in constraint query languages.pdf

Exact and Approximate Aggregation in ConstraintQuery LanguagesMichael BenediktBell Laboratories263 Shuman BlvdNaperville, IL 60566E-mail: benedikt@research.bell-labs.comLeonid LibkinBell Laboratories600 Mountain AvenueMurray Hill, NJ 07974E-mail: libkin@bell-labs.comAbstractWe investigate the problem of how to extend constraint query languages with aggregate oper-ators. We deal with standard relational aggregation, and also with aggregates speci c to spatialdata, such as volume. We study several approaches, including the addition of a new class of ap-proximate aggregate operators which allow an error tolerance in the computation. We show howtechniques based on VC-dimension can be used to give languages with approximation operators,but also show that these languages have a number of shortcomings. We then give a set of resultsshowing that it is impossible to get constraint-based languages that admit de nable aggregationoperators, both for exact operators and for approximate ones. These results are quite robust, inthat they show that closure under aggregation is problematic even when the class of functionspermitted in constraints is expanded.This motivates a di erent approach to the aggregation problem. We introduce a language FO+Poly+Sum, which permits standard discrete aggregation operators to be applied to the outputs ofrange-restricted constraint queries. We show that this language has a number of attractive closureand expressivity properties, and that it can compute volumes of linear-constraint databases.1 IntroductionNew applications of database technology, such as Geographical Information Systems, have spurred aconsiderable amount of research into generalizations of the standard relational model to deal with themanipulation of geometric or spatial data. One common approach to modeling spatial database is toconsider input databases as given by a set of well-behaved relations in euclidean space { for example,by a set of semi-linear or semi-algebraic sets. There are a number of proposed query languages thatextend classical relational algebra to this setting, languages that allow the use of various geometricoperations in manipulating spatial databases. One of the most well-developed models for spatialqueries is the constraint database model [23]. In this model, spatial databases are represented as setsof linear or polynomial constraints. Databases are queried using standard relational calculus withlinear (resp. polynomial) inequalities as selection criteria, see [4, 5, 6, 19, 20, 32, 38]. These languages,denoted by FO + Lin and FO + Poly, have become the dominant ones in the constraint databaseliterature. They have a very important closure property: the application of a FO + Lin query to alinear constraint set yields a new set of linear constraints; similarlyFO+Poly queries on sets de nablewith polynomial constraints produce sets that can still be de ned with polynomial constraints.1ConstraintQueryLanguages, then, give a naturalanalog of relationalcalculusinthe geometric context.A crucialquestion, though, concerns how to extend standardaggregation constructs from the relationalmodel to the geometric setting. This question has two components. First, we would like our languagesto be able to apply standard SQL operators such as TOTAL and AVG to spatial queries, wheneverthese operators make sense. Since the output of queries in constraint query languages (and in otherspatial query languages) may be merely nitely representable (that is, representable by some nitemeans, e.g., a nite set of constraints) and not nite, the aggregation operators cannot be allowed tobe applied to any constraint query output. One problem then, is to design a language that allows thesafe application of classical aggregates.The second component of the `aggregation question concerns aggregation notions that are speci cto the spatial databases. Most commonly, given a database and the output of a query over it, onewishes to form new queries about the volume of this output. One may also extend standard aggregatessuch as AVG, and ask for the average value of a polynomial over a spatial object. Such aggregatesarise both from practical concerns of GIS, and also as the natural continuous analogs of classicalaggregation queries. Thus, we would like to extend constraint query languages to incorporate theability to calculate volumes and other aggregates arising in the spatial setting.In attempting to add aggregation to constraint query languages, one immediately encounters somedaunting obstacles. While standard constraint databases are closed under rst-order operations suchas join and projection, they are clearly not closed under taking of volumes. This fact is well-knownin the literature [23, 27, 12], and stems from the fact that neither the semi-linear nor semi-algebraicsets are closed under integrals. To take an example from the semi-algebraic setting, a query askingfor the volume of initial slices of the epigraph of 1=x outputs the graph of the ln function, whileiterating volume queries in this fashion would give as output transcendental functions that are noteven expressible using eld operations, logarithms and exponents. Thus, one cannot hope to add ageneral volume operator to existing rst-order constraint query languages such as FO+Poly and geta closed language while still remaining within the domain of polynomial constraint databases.There are several approaches to the volume problem mentioned above. First, one could weaken therequirement that volumes be computed exactly and instead aim only to compute approximate volumes.Thus a query might have a tolerance associated with each instance of a volume operator, with outputrequired only to be correct within the given tolerance. There are a number of practical and theoreticalmotivations for this approach. While it is known that computing volumes of even simple geometricobjects (convex polytopes) is a hard problem (#P-hard, see [14]), approximation of volumes, atleast of convex sets, can be done in polynomial time by a randomized algorithm [15]. Moreover, incontrast to the well-known fact that semi-algebraic and semi-linear sets are not closed under volumeoperators, the papers [24, 25, 26] show that volumes of sets de nable with polynomial constraints canbe approximated, for any given 0, by a rst-order formula with polynomial constraints [24, 25, 26].By givingupexact volumeandsettlingfor an approximation, one might hopeto retain desirableclosureproperties.A second approach to the aggregation problem would be to expand out of the domain of polynomialconstraints, and add new functions to the signature of both the constraints and the query language.This would give the possibility of retaining a constraint-based representation of databases, whilegaining closure under volume operators. Of course, in this approach one should expand the constraintset so that it still de nes only topologically well-behaved objects.A third approach to the volume problem is to search for languages which can compute or approximatethe volumes of important classes of sets, but which may not be closed under iterative application of2volume operators. For example, one could allow volume and other aggregation operators to be appliedonly to a subclass of the input queries. Restrictions on the nesting of volume operators would thenhave to be imposed.An example of this last approach in the existing literature is [11], where it is shown that polynomialconstraint query languages can express the (exact) volume for any set that admits a special conditioncalled `variable independence . This condition means, informally, that in the constraint speci cationof sets in, say, R2, there is no interaction between x and y. Unfortunately, this condition is toorestrictive: it excludes many of the sets that arise most often in spatial applications.In this paper, we analyze the feasibility of each of the above approaches in detail. For the rstapproach, we show that techniques based on VC dimension, coming out of the work of [24, 25, 26] giveus approximate volume operators that give semi-algebraic output on semi-algebraic input. However,we show a number of shortcomings of such an approach. Not only are the approximate volumeoperators obtained according to the technique of [24, 25, 26] sensitive to the input representation,but the blow-up in the size of the constraint databases produced in query evaluation precludes anypossible use of these operators in practice.Turning to the second approach, we show that it is completely infeasible. No rst-order constraintlanguage based on any reasonably well-behaved class of functions can express, or even approximate,volume. In the process of showing this, we develop a new set of techniques for proving inexpressibilityresults, techniques not based on the usual method of reduction to generic queries.We then considersolutionsthat give upfullclosureundervolume, and give a numberof positiveresults.We present a higher-order language that allows one to calculate the volume of arbitrary semi-linearsets. Speci cally, we give a language, called FO+Poly+Sum, that has attractive closure properties,remains within the domain of polynomial constraint databases, and allows the exact calculation ofvolumes for linear-constraint input databases. This language also has the pleasant feature that it isclosed under the classical aggregation operators SUM and AVG. Since FO+Poly+Sum includes SQLaggregation, contains FO+Poly, and also allows one to make use of standard aggregation evaluationtechniques in calculating volumes, it seems to be a good candidate for the constraint analog of classicalaggregation languages.We remark that another approach to the aggregation problem was considered in [12], which gave anew aggregate operator , under which FO +Lin is closed. However, (X) = 0 for any bounded setX; thus, this operator cannot be used to deal with volumes.Organization Section 2 introducesthe notation. ApproximabilityisstudiedinSection 3. The methodof de ning approximate volumes of [24, 25, 26] is analyzed, and the main diculties in applying theapproximation operators coming from this work are outlined. Section 4 shows that approximatevolume operators cannot be de ned in rst-order constraint languages, even when the signature isexpanded. Section 5 de nes an extension of FO +Poly with SQL-like aggregation (summation overnite sets) and shows that this extension can express volumes of semi-linear databases.2 NotationStructures, instances, queries Most notations are fairly standard in the literature on constraintdatabases, cf. [5, 6, 32, 19]. Let M = hU; i be an in nite structure, where U is an in nite set, calleda universe (in the database literature often called the domain), and is a set of interpreted functions,3constants, and predicates. In the eld of constraint databases, most examples have U = R, the setof real numbers. Examples of signatures (and corresponding classes of constraints) that have beenconsidered are:Dense Order Constraints: hR; 0. We shall consider both nite and nitely representable instances. GivenM, an nite instance of SC over M is a family of nite sets, fR1;:::;Rlg, where Ri Upi. Thatis, each schema symbol Siof arity piis interpreted as a nite pi-ary relation over U. Given a niteinstance D, adom(D) denotes its active domain, that is, the set of all elements that occur in therelations in D.A nitely-representable (f.r.) instance of SC over M is a family of sets fX1;:::;Xlg, with Xi Upi,such that for each Xithere exists a quanti er-free formula i(x1;:::;xpi) in the language of M withXi= f~a2 Upij M j= i(~a)g. Most applications of constraint databases consider f.r. instances de nedover Rlin(these are called semi-linear sets) or over R (called semi-algebraic sets). For example, in thespatial setting, a f.r. instance interprets the schema predicates as semi-linear or semi-algebraic sets.As our basic query language, we consider relational calculus, or rst-order logic, FO, over the un-derlying structure and the database schema. In what follows, L(SC; ) stands for the language thatcontains all symbols of SC and ; FO(SC; ) is the class of all rst-order formulae built up from theatomic SC and formulae by using Boolean connectives _;^;: and quanti ers 8;9.Regardless of whether we are in the `classical setting, where these queries are applied to nitedatabases, or in the constraint query setting, we will refer to the above syntactic query languages asrelational calculus with constraints. This will be denoted by FO + . When is (+;?;0;1; 0,one can nd a formula (~x;z) that gives -approximation of volumes of sets (~a;R) = VolI(f~b j j= (~a;~b)g), see [24, 25, 26].We have to explain what we mean by approximating volume in this context. Clearly, we cannot hopeto nd (~x;z) with z de ning an -interval around the real value of the volume { then the volumeitself would be de nable as the center of the interval! Thus, we settle for less. Similar to [24, 25, 26],we say for every 0, that an operator Volis an -approximation operator if for every f.r., over M,set A 2 RnRm, given by a formula (~x;~y), Volreturns a f.r. set in RnR, given by (~x;z)such that :1. For every ~a2 Rn, (~a;) must be satis able (that is, M j= 9z: (~a;z));2. If M j= (~a;v), then jv?Vol( (~a;R))j 0 there is a L-query (~x;z) such that for anydatabase D, and any ~a, we have1) D j= 9z: (~x;z), and2) D j= (~a;v) implies jv?Vol( (~a;D))jag (we assume that 0, and let (~x;~y) be a FO + Poly query. Then for every semi-algebraic (resp.semi-linear) database instance D there exists a formula D(~x;z) over the real ordered eld (resp.group) such that D(~a;) is satis able for all ~a, and j= D(~a;v) implies jv ? VolI( (~a;D)) j 0, for R and Rlin6Since we want to examine those operators with regard to their eciency, we now review the ideas of[24, 25, 26] that lead to this theorem.Pre-requisites from Computational Learning Randomized methods for computing volumeshave been known and used for a long time. If we have a set X In Rn, take k points x1, ., xkfrom the uniform distribution on In. Then Vol(X) can be approximated as vX=Pki=1X(xi)=k,where Xis the characteristic function of the set X: X(x) = 1 if x2X and X(x) = 0 if x62X. Itis well known that for 0,P(jvX?Vol(X)j )) 0. Let M 0 be given, and assume that an M-point sample C = f~c1;:::;~cMgis randomly chosen in Im. For each ~a, let v(~a;C) be the fraction of C that falls into (~a;R) \Im.Then one wants to achievejv(~a;X)?VolI( (~a;R))jmax(4log2;8dlog13), wheredis the ( nite) VC dimension of the de nable familyF (R) = f (~a;R) j~a2 Rng 2Rm.In the construction of approximating formulae, we shall use the following corollary of this result, thatstates the existence of so-called -nets:Fact 1 (-nets) Let (~x;~y) be a rst-order formula over the real eld R, with j~yj= m, and let 0.Let d = VCdim(F (R)). If M 8dlog13, then there exists an M-element set C = f~c1;:::;~cMg Imsuch that for every ~a wit