Issuu on Google+

Available online at www.sciencedirect.com

The Journal of Systems and Software 81 (2008) 1931–1943 www.elsevier.com/locate/jss

Benchmarking temporal database models with interval-based and temporal element-based timestamping q Seo-Young Noh a,*, Shashi K. Gadia b a

LG Electronics Advanced Recearch Institute, 16 Woomyeon, Seocho, Seoul 137-724, Republic of Korea b Department of Computer Science, Iowa State University, Ames, Iowa 50010, USA Received 1 September 2007; received in revised form 12 January 2008; accepted 15 January 2008 Available online 1 February 2008

Abstract Starting from mid 1980s, there has been a debate about what data model is most appropriate for temporal databases. A fundamental choice one has to make is whether to use intervals of time or temporal elements to timestamp objects and events with the periods of validity. The advantage of using interval timestamps is that Start and End columns can be added to relations for treating them within the framework of classical databases, leading to quick implementation. Temporal elements are finite unions of intervals. The advantage of temporal elements is that timestamps become implicitly associated with values, tuples, and relations. Furthermore, since temporal elements, by design, are closed under set theoretical operations such as union, intersection and complementation, they lead to query languages that are natural. Here, we investigate the ease of use as well as system performance for the two approaches to help settle the debate. Ó 2008 Elsevier Inc. All rights reserved. Keywords: Temporal element-based data model; Interval-based data model; Model comparisons; Performance evaluations; XML-based temporal DBMS

1. Introduction Temporal data models are relatively complicated because they should capture temporal characteristics of data. Researchers have developed and introduced many types of temporal data models which can be categorized by two criteria such as domain representation and timestamping. Domain representation distinguishes temporal data models by the way of representation of domains for values. There are three different types of representations for domains such as point-based, interval-based, and temporal element-based. Timestamping, the other criterion, is a way of associating temporal domains with values. There are two timestamping methods such as tuple-level and attribute-level timestamping. q This work has been sponsored in part by Information Infrastructure Institute and a Grant from Baker Foundation at Iowa State University. * Corresponding author. E-mail address: rsyoung@lge.com (S.-Y. Noh).

0164-1212/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.jss.2008.01.015

Despite extensive research on temporal data models, it is hard to find model comparisons in ease of use and system performance. In this paper, we will compare interval-based and temporal element-based data models, evaluate their usability, and measure system performance in terms of disk block accesses. The former data model uses intervals for temporal domains and tuple-level timestamping while the latter uses temporal elements and attribute-level timestamping. We will introduce two temporal query languages—ISQL (Interval-based Structured Query Language) and ParaSQL (Parametric Structured Query Language) which are query languages for interval-based data model and temporal element-based data model, respectively.1 In order to carry out the comparisons of two data models for ease of use, we develop a query suite which is extended from the primary version of queries introduced in Jensen et al. (1993). The

1

ISQL is a hypothetical query language for interval-based data models.


1932

S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

query suite is stated in plain English; thus, it is independent of data models and query languages. In addition to this, we implement the two temporal data models. In our implementation, XML is used as the main implementation platform. For the interval-based data model, we follow the industry standard binary structure. However, tuples are encapsulated into XML elements by a primitive iterator. Because of this, a query execution engine of the data model considers tuples as XML elements so that it can seamlessly utilize XML for query evaluations. For the temporal element-based data model, we utilize XML for parse tree, expression tree, and the query execution engine like the interval-based data model. However, it stores relations in XML as paginated multiple small selfcontained XML pages in a disk. Our benchmarks essentially compare the complexity of queries of ISQL and ParaSQL and system performance of the two data models. Our benchmarks reveals that ParaSQL is more user-friendly than ISQL for the query suite. The ISQL queries frequently invoke self-joins to combine scattered tuples for an object, resulting in complex queries. One interesting outcome is that although self-joins increase query complexity of ISQL, they do not significantly degrade system performance if a special treatment is used. Such a treatment is one-pass algorithm and gives ISQL maximum benefit of the doubt. However, it must be emphasized that ISQL queries perform marginally better than those in ParaSQL where only literal execution is considered. In addition to the performance evaluation, we remark that the interval-based data model is dependant upon order properties of instants of time, whereas the temporal element-based data model is based upon set operations on domains which can seamlessly extend to spatiotemporal databases. In spatial databases, the luxury of using intervals is not even available; spatial domains are far more complex. The rest of this paper is organized as follows. Section 2 briefly reviews related work in model comparisons. Section 3 and Section 4 introduces the interval-based and temporal element-based data models, respectively. Section 5 compares two query languages based on the query suite. Section 6 presents overviews of the system implementation. Section 7 discusses the system performances of the two data models. Section 8 concludes this paper with our observations.

lems caused by interval-based data models. First, the query languages are complicated as they explicitly refer to intervals, rather than to the individual time instants. Second, the formulation of queries in the interval-based query language is not natural. On the other hand, Bo¨hlen et al. (1998) advocated interval-based temporal data models. They indicated that queries in point-based models treat snapshot equivalent argument relations identically. Thus this feature renders point-based models insensitive to coalescing. Despite the debates between point-based and intervalbased data models, there is an agreement on the system implementation from practical viewpoint. Both sides have acknowledged the advantages of the interval-based data models in system implementations.2 Toman (1997) and Bo¨hlen (2004) have provided different approaches to unifying point-based and interval-based temporal data models. Temporal element-based data models have been introduced in the mid 1980s (Gadia and Vaishnav, 1985; Gadia, 1988; Tansel et al., 1993; Gadia and Chopra, 1993). Temporal elements are to obtain timestamps that are closed under the set theoretic operations of union, intersection, and complementation. Temporal element-based frameworks model an object in terms of a single tuple rather than multiple tuples. Although broadly there are three different approaches to modeling temporal data, very little work has been done on the model comparisons between interval-based and temporal element-based data models. The focus in this paper is on the use of intervals vs. temporal elements. Most implemented temporal databases are categorized into two groups: temporal database systems over relational databases and temporal database systems over object-oriented databases. Using relational databases is the most popular approach to implementing temporal databases (Hubler and Edelweiss, 2000). In addition to conventional approach to implementing temporal databases, there is an emerging trend that researchers are combining XML technology in database research. Wang and Zaniolo (2004) used XML for their bitemporal data model and showed XML and XQuery, the general purpose native query language for XML, could be used to query historical data.

2. Related work

3. Interval-based models

There are three different types of domain representations in the literature of temporal data models. Point-based method views time domain as a discrete, linearly ordered set without end points. Interval-based method represents a domain of an object as the continuous maximum time interval. Temporal element-based scheme represents domains of an object as finite unions of time intervals. In the temporal database community, there were some debates between point-based and interval-based data models for temporal data. Toman (1996) compared point-based and interval-based data models, and pointed out two prob-

3.1. Overview We consider an interval-based data model that uses time-intervals and tuple-level timestamping. Such a data model uses Start and End attributes to specify the period of validity for the information in the tuple (Snodgrass, 1987; Navathe and Ahmed, 1993).

2 As Toman (1996) and Bo¨hlen et al. (1998) have indicated that the implementation of the point-based models is impractical.


S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

ISQL :=

1933

SELECT <attribute list> [RESTRICTED TO <interval> | <instant>] FROM <relation list> [WHERE <boolean expression>]

Fig. 3. Simplified BNF of ISQL.

Fig. 1. Interval-based Emp relation.

Fig. 1 shows an interval-based temporal relation. The Emp relation maintains the history of employees with name, salary, and department information. As shown in Fig. 1, every tuple has Start and End attributes showing temporal domains of tuples. XML is the base platform for the interval-based data model. Although the storage handles binary pages, iterators retrieve tuples and capsulate them into XML elements because its upper layer handles elements for query executions. The storage layer is transparent to the upper layers, assuming it as an XML storage. Fig. 2 shows how tuples in Emp relation of the intervalbased data model can be encapsulated in XML. Note that its physical view is binary pages containing tuples like relational databases. As shown in Fig. 2, Relation element consists of multiple tuple elements. Every tuple element has five attributes. XML-tagged tuples are generated by the storage manager and passed to the query execution engine which processes XML elements. 3.2. Interval-based structured query language For the sake of simplicity, we define ISQL (Intervalbased Structured Query Language) which is a hypothetical query language and a place holder for many interval-based <Relation name="Emp"> <tuple> <name>Tom</name> <salary>45000</salary> <dept>Hardware</dept> <start>0</start> <end>20</end> </tuple> <tuple> <name>Tom</name> <salary>50000</salary> <dept>Sales</dept> <start>41</start> <end>51</end> </tuple> <!-- the rest tuples are omitted --> </Relation> Fig. 2. XML representation of Emp relation.

data models. Because it is impractical to implement and compare all interval-based query languages, a hypothetical query language is desired for our purposes. Fig. 3 shows the skeleton of ISQL select statement. All clauses in ISQL have the same functionality of classical SQL except RESTRICTED TO clause. This clause is to capture specific time domain information. We must note that the clause can only have either a single time interval or a single time instant. It does not allow a nested form in the clause because not all domains of nested query can be an interval. For example, some tuples in an object can be screened by a Boolean expression so that the domain of a resulting relation should be the union of intervals which is not an interval. In ISQL, nested queries are allowed in WHERE clause like classical SQL. A Boolean expression determines if a given tuple satisfies the Boolean condition. It has the same functionality of classical SQL in that it either qualifies or disqualifies a tuple. 4. Temporal element-based data model 4.1. Overview In the temporal element-based data model, attributes are defined as functions. One important requirement is that the domains of functions should be closed under set theoretical operations such as union, intersection, and complementation. In order to obtain timestamps that are closed under such operations, the concept of temporal elements was introduced in Gadia and Vaishnav (1985), Gadia (1988), Tansel et al. (1993), Gadia and Chopra (1993). A temporal element is defined as a finite union of time intervals. Fig. 4 shows Emp relation of the temporal element-based data model.

Fig. 4. Temporal element-based Emp relation.


1934

S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

<Relation name="Emp"> <tuple> <attribute name="Name"> <attr>Tom</attr> <pdom> <dunit>[0,20]</dunit> <dunit>[41,51]</dunit> </pdom> </attribute> <attribute name="Salary"> <attr>45000</attr> <pdom><dunit>[0,20]</dunit></pdom> <attr>50000</attr> <pdom><dunit>[41,51]</dunit></pdom> </attribute> <attribute name="Dept"> <attr>Hardware</attr> <pdom><dunit>[0,20]</dunit></pdom> <attr>Sales</attr> <pdom><dunit>[41,51]</dunit></pdom> </attribute> <pdom> <dunit>[0,20]</dunit> <dunit>[41,51]</dunit> </pdom> </tuple> <!-- the second tuple has been omitted --> </Relation>

<attribute name="Name"> <attr>Tom <pdom> <dunit>[0,20]</dunit> <dunit>[41,51]</dunit> </pdom> </attr> </attribute> Fig. 6. Attribute representation with mixed content.

ParaSQL := SELECT <attribute list> [RESTRICTED TO <domain expression>] FROM <relation list> [WHERE <boolean expression>]

Fig. 7. Simplified BNF of ParaSQL.

Although Fig. 6 follows the style of the parametric temporal relation, it is generally considered as poorly designed XML because there is no way to validate XML values using DTDs or XML Schemas. Therefore, we use the XML representation shown in Fig. 5. 4.2. Parametric structured query language

Fig. 5. Temporal element-based XML representation of Emp relation.

It is worth noting that information about an object (or event) is modeled in a single tuple while in the intervalbased data model an object is modeled in multiple tuples. In Fig. 4, examples of temporal elements are ½0; 20 [ ½41; 51 and [11, 60]. Note that an instant, such as 11, can be represented by the interval [11, 11] which is also a temporal element. Since attributes are functions in the temporal elementbased data model, they capture the changing values of attributes. An example of temporal value of Salary attribute in the first tuple of Emp relation is h[0,20] 45000, [41,51] 50000i. If n is a temporal value, snt denotes its domain. Thus, the domain of Salary of the first tuple will be as follows: sh½0; 2045000; ½41; 5150000it ¼ ½0; 20 [ ½41; 51 Fig. 5 shows an XML representation of the temporal element-based Emp relation. It is worth noting that physically XML documents are paginated and all pages are represented in binary format, leading to the same effect of compression.3 The XML style shown in Fig. 5 does not follow the style of Emp relation shown in Fig. 4. In Emp relation, an attribute value contains its domain. If it is a constraint to be followed in our XML representation, name attribute should be expressed as mixed contents shown in Fig. 6.

ParaSQL (Parametric Structured Query Language) is a query language used for the temporal element-based model. Fig. 7 shows the simplified select statement of ParaSQL. ParaSQL has the same clauses like ISQL. However, ParaSQL’s RESTRICTED TO clause is more versatile than ISQL’s. ParaSQL allows a domain expression which can be an instant, an interval, a temporal element, or a nested select statement while ISQL only allows either a time-interval or an instant in the clause. Since temporal elements are used for temporal domains, it is possible to nest another select statement in a domain expression. Note that the set of temporal elements are closed under union, intersection, and complementation. A domain expression returns a domain and it can be used in RESTRICTED TO clause to restrict the domain of tuples (or objects) filtered by a Boolean expression. A Boolean expression has the same functionality as classical SQL, but it differs from classical SQL because it can be constructed by domain expressions with set operations (Noh and Gadia, 2004). For example, a Boolean expression, E.DName=’Software’, is an abbreviation of sE:DName ¼ ‘ Software0t 6¼ ;, which means that sometime an employee worked in Software department. 5. Query suite and query comparisons 5.1. Query suite development

3 In the practical implementation, element names were abbreviated to reduce databases sizes.

Developing a query suite is important in model comparisons and system evaluations. In the literature of temporal


S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

databases, we can find many different types of query suites. Jensen et al. (1993) have provided a benchmark query suite for temporal databases. The query suite is dependent on TSQL. Wang and Zaniolo (2004) have introduced a query suite which is categorized into five types such as relation scan, history, snapshot, interval, and temporal join. We have developed our own query suite which is extended from the preliminary version of queries introduced in the appendix of Jensen et al. (1993). It includes categories introduced by Wang and Zaniolo. However, our query suite has some additional types. For example, some query characteristics like an instant and an interval are known before executing a query (parsing time or explicit);but, sometimes the characteristics may be implicit before executing a query. 5.2. Query comparisons Our proposed query suite consists of 10 English queries and the schemas used in our benchmarks are as follows: ISQL: Emp(Name, Salary, DName, Start, End) Dept(DName, MName, Start, End) ParaSQL: Emp(Name, Salary, DName) Dept(DName, MName)

1935

tion is used in a Boolean condition. If the Boolean condition compares a non-key attribute, then we can find the difference between ISQL and ParaSQL as shown in Query 2*. Query 2*. Retrieve the history of employees whose salary was (is) greater than $50,000. ISQL: ParaSQL: SELECT E1.* SELECT * FROM Emp E1, Emp E2 FROM Emp E WHERE E1.Name = E2.Name WHERE E.Salary>50000 AND E2.Salary > 50000

Since an object in the interval-based data model is scattered in multiple tuples, the ISQL needs a self-join to combine tuples for non-key attribute condition. However, the ParaSQL is the same as Query 2 except the Boolean condition. Query 3. Give name and salary (history) of all employees working currently. ISQL: SELECT E1.Name, E1.Salary FROM Emp E1, Emp E2 WHERE E1.Name = E2.Name AND E2.End = NOW

ParaSQL: SELECT E.Name, E.Salary FROM Emp E WHERE NOW subset [[E.Name]]

Query 1. Give name and salary (history) of all employees. ISQL: SELECT E.Name, E.Salary FROM Emp E

ParaSQL: SELECT E.Name, E.Salary FROM Emp E

ISQL and ParaSQL are the same for the query. However, their internal procedurals are different because ISQL retrieves partial information of an object at a time while ParaSQL does whole information for the object. Query 2. Retrieve the salary history of employee Bob. ISQL: SELECT E.Salary FROM Emp E WHERE E.Name="Bob"

ParaSQL: SELECT E.Salary FROM Emp E WHERE E.Name="Bob"

Since it is nonpointic query,4 a Boolean condition in WHERE clause should be used. As noted, ISQL introduces a self-join operation while ParaSQL is a relation scan. In ParaSQL, the Boolean expression, NOW subset [[E.Name]], means that it is true if the current time is subset of the tuple’s domain because Name is the key of Emp relation. The Boolean expression can be rewritten as NOW subset [[E]]. Query 4. Give name and salary at instant 60 of all employees. ISQL: SELECT E.Name, E.Salary RESTRICTED TO [60,60] FROM Emp E

ParaSQL: SELECT E.Name, E.Salary RESTRICTED TO [60,60] FROM Emp E

ISQL and ParaSQL queries are identical. However, we must note that ParaSQL’s WHERE clause and ISQL’s WHERE clause are different in that ParaSQL’s WHERE clause potentially considers the whole information about an object while ISQL’s WHERE clause considers a tuple at a time.

The ISQL and ParaSQL are identical because Query 4 requests information for a simple time instant which is atomic granule. But, we can note the fundamental difference between two data models if Query 4 is changed to Query 4*, by requesting information for two disjoint time instants.

It is worth nothing why ISQL and ParaSQL have the same structure for Query 2. It is because the key of Emp rela-

4 A nonpointic query extracts all information about an object while a pointic query extracts a part of information about an object.


1936

S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

Query 4*. Give name and salary at instant 60 or 70 of all employees. ISQL: SELECT E.Name, E.Salary RESTRICTED TO [60,60] FROM Emp E UNION SELECT E.Name, E.Salary RESTRICTED TO [70,70] FROM Emp E

ParaSQL: SELECT E.Name, E.Salary RESTRICTED TO [60,60] + [70,70] FROM Emp E

Query 7. Give name and salary history during the time John was a manager of Software department.

In ISQL, RESTRICTED TO clause only allows a single time instant or a single interval so that the union of two time instants cannot be expressed in the clause. Note that ISQL’s base model is the interval-based temporal data model. Therefore, it is required to union two select statements for two time instants. In general, ISQL requires k select statements for k disjoint intervals, making ISQL more complex. In contrast to ISQL, ParaSQL expresses Query 4* in the same way of Query 4, which is more naturally reflecting natural language. The domain expression, [60,60] + [70,70], means the union of two time instants, i.e., ½60; 60 [ ½70; 70. Query 5. Find all employees throughout interval [60, 80]. ISQL: SELECT E.Name, E.Salary RESTRICTED TO [60,80] FROM Emp E

ParaSQL: SELECT E.Name, E.Salary RESTRICTED TO [60,80] FROM Emp E

The query structures of the ISQL and ParaSQL are the same, but the internal procedures are different as discussed in Query 4. If the English query needs multiple disjoint intervals, the ISQL query should be changed to multiple unions of select statements while the ParaSQL simply adds multiple unions of intervals in RESTRICTED TO clause. Query 6. Retrieve managers.

ISQL restricts joined tuples to the intersection of domains of tuples from Emp and Dept relations. The expressions, [[E]] and [[D]], can be replaced with [[E.Name]] and [[D.DName]], respectively, because all attributes in a same tuple share a same interval. Note that [[E]]*[[D]] is legal because the intersection of two intervals is an interval. In ParaSQL, RESTRICTED TO clause is used to join two relations.

names

ISQL: SELECT E.Name, D.MName RESTRICTED TO [[E]]*[[D]] FROM Emp E, Dept D WHERE E.DName = D.DName

of

employees

and

their

ParaSQL: SELECT E.Name, D.MName RESTRICTED TO [[E.DName=D.DName]] FROM Emp E, Dept D

ISQL: SELECT E.Name, E.Salary RESTRICTED TO [[E]]*[[D]] FROM Emp E, Dept D WHERE E.DName = D.DName AND D.MName ="John" AND D.DName ="Software" ParaSQL: SELECT E.Name, E.Salary RESTRICTED TO [[ SELECT * RESTRICTED TO [[D.DName="Software"]] FROM Dept D WHERE D.MName ="John"]] FROM Emp E

ISQL needs to join two relations while ParaSQL scans a relation with the nested select statement. ParaSQL is closer to natural languages in that a user does not think about the query in terms of tuples, but in terms of objects. In the ParaSQL query, it should be emphasized that the nested query is separated from the outer query, that is, there are no relationship between Dept and Emp relations. Therefore, a smart query executor can process them independently to reduce disk accesses. Query 8. Give name and salary history during the time John was not a manager of Software department. ISQL: SELECT E.Name, E.Salary FROM Emp E DIFFERENCE SELECT E.Name, E.Salary RESTRICTED TO [[E]]*[[D]] FROM Emp E, Dept D WHERE E.DName = D.DName AND D.MName ="John" AND D.DName ="Software" ParaSQL: SELECT E.Name, E.Salary RESTRICTED TO [[ SELECT * RESTRICTED TO [[D.DName="Software"]] FROM Dept D WHERE D.MName ="John"]] FROM Emp E

The notation, ‘’, in RESTRICTED TO clause, represents the complementation of a domain expression. This ParaSQL is exactly same as Query 7 except it has the complementation of a nested select statement whose output is a


S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

temporal element. It is legal because the set of temporal elements are closed under complementation. Therefore, ParaSQL naturally expresses Query 8. In contrast, the ISQL query needs two steps. It first retrieves all name and salary information of employees while John was a manager of Software department. Second, it subtracts the tuples of the first step from Emp relation.5 It does not follow the features of natural languages. Query 9. Give current departments of employees who have worked in Hardware or Software department. ISQL: SELECT E1.DName RESTRICTED TO NOW FROM Emp E1, Emp E2 WHERE E1.Name=E2.Name

ParaSQL: SELECT E.DName RESTRICTED TO NOW FROM Emp E WHERE E.DName="Hardware"

AND (E1.DName="Hardware" OR E.DName="Software" OR E1.DName="Software") AND E2.End=NOW In ParaSQL, domains with which a condition is satisfied and the domain to be retrieved can be independent of each other because both domains are in the same tuple. However, in ISQL the domains reside in different tuples and self-joins are inevitable. Query 10. Give current departments of employees who have worked in Hardware and Software departments. ISQL: SELECT E1.DName RESTRICTED TO NOW FROM Emp E1, Emp E2, Emp E3 WHERE E1.Name = E2.Name AND E2.Name=E3.Name

ParaSQL: SELECT E.DName RESTRICTED TO NOW FROM Emp E WHERE E.DName="Hardware" AND E.DName="Software"

AND E1.DName="Hardware" AND E2.DName="Software" AND E3.End=NOW This natural language query only changes ‘‘or” condition of Query 9 to ‘‘and” condition. ParaSQL mimics the natural language’s behavior. Thus the complexity of the query dose not change. In ISQL, it needs 3-way self-join for 2 conjunctive conditions. In general, the interval-based data model needs ðn þ 1Þ-way self-joins for conjunction of n conditions (Gadia et al., 1993).

5

In this query, we assume that ½a; b  ½c; d is equal to ½a; b \ ½c; dc .

1937

6. System implementation 6.1. System architecture We have developed two temporal database systems named I-based (interval-based) system and TE-based (temporal element-based) system, respectively. Because two systems have similar high level components except storages (binary relation storage vs. XML storage), we will discuss the abstract level system architecture which combines the two temporal database systems. Fig. 8 shows the system architecture. In order to execute an ISQL query, the I-based system transforms the ISQL query into a parse tree represented in XML by the ISQL parser. The parse tree is again transformed to an expression tree which is also an XML document by the ISQL expression tree generator. The expression tree is executed by the ISQL query execution engine. As shown in the figure, when the system generates XML documents and transforms it to another XML document, it uses DOM API (W3C, 2007). TE-based system executes a ParaSQL query in the similar manner like I-based system. However, the ParaSQL query execution engine uses DiskDOM API (Ma, 2004) instead of directly accessing W3C’s DOM API. DiskDOM API is customized API to process XML pages which has additional information added when paginating XML into smaller pages. Such API is transparent from clients and clients like the query execution engine use it as DOM API. I-based system uses a binary relation storage while TEbased system uses an XML-based storage. Although the two storages are internally different, they follow the general functions of database storages at high level. They communicate with their space managers with pages. The space managers communicate with the buffer managers. The two storages have their own iterators which differently behave. The iterator of the I-based system retrieves a tuple at a time in binary format, makes the tuple in an XML element adding some necessary tags, and returns it to the ISQL query execution engine. The iterator of the TE-based system works similarly, but it returns a node (or element) which is requested by the ParaSQL query execution engine. In general, a tuple in TE-based system may exceed the size of a page because the temporal element-based data model accumulates history information in a single tuple. Therefore, the TE-based system returns a node only when it is necessary to return a tuple instead of entire data for a tuple. If the query engine needs to access a specific part of a tuple, the buffer manager provides a corresponding page which is a self-contained XML. Therefore, it can overcome memory restrictions for loading huge XML. 6.2. Expression trees All queries introduced in Section 5.2 are transformed into expression trees, which coincidentally are represented by XML. For example, Fig. 9(a) shows an expression tree


1938

S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

Fig. 8. System architecture.

Fig. 9. Expression tree and its XML representation. (a) An expression tree and (b) XML representation.

of ParaSQL’s query 6 in Section 5.2.6 The expression tree is an abstract-level description and is transformed to an 6

All ISQL queries are similarly represented in XML.

XML document. The XML representation for the expression tree is shown in Fig. 9(b). In the XML representation, projection, restriction, and where condition nodes are at the same level that is different from the abstract expression


S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

tree. However, it is analyzed by the query executors in the same order it appears in the abstract expression tree (Noh and Gadia, 2005a, 2006). Representation of an expression tree as a text-based XML document instead of a traditional pointer-based data structure is more appropriate. Such a document is human readable and it can be processed using DOM parser, using very high level coding that is also easy to comprehend. This use of XML is independent of the main theme of this paper. Note that an expression tree is a main-memory data structure and efficiency issues are of marginal significance in databases where most time is spent in query processing. 6.3. Query execution The query execution layer provides execution results for expression trees. The ISQL query executor analyzes, navigates, and executes ISQL expression trees by using DOM API. Like the ISQL query executor, the ParaSQL query executor uses DOM API to analyze and navigate ParaSQL queries. But it uses DiskDom API when it retrieves disk blocks containing tuples from a disk. DiskDOM API is capable of navigating nodes (or elements) which reside in different pages resulted from pagination. Algorithm 1 shows the query execution procedure which can be abstractly applied to ISQL and ParaSQL query executors.7 It determines which iterator should be used to process the given expression tree. The expression tree has information on iterators, and the information is extracted by the query executor. Once an iterator is determined, the iterator retrieves and qualifies tuples. A qualified tuple is passed to RESTRICTION function as an argument with the expression tree to restrict the domain of the tuple. Algorithm 1 Query Execution Algorithm 1:

procedure

2: 3: 4: 5: 6: 7: 8: 9:

if e has join condition then it JoinðeÞ .it: iterator else if e has a relation scan then it RelationScanðeÞ end if while it:hasNextðÞ ¼ true do tuple it:getNextðÞ . retrieve a tuple tuple Restrictionðe; tupleÞ . restrict domain of a tuple if tuple 6¼ null then output(tuple) . write a tuple end if end while end procedure

10: 11: 12: 13: 14:

QUERYEXECUTION

e

.e: expression tree

7 The details of the algorithm can be found in Noh and Gadia (2006, in press).

1939

6.4. Storage management Storage management layer handles page requests from its upper layer. It provides a requested page from a disk. The storage manager of the I-based system manages pages which contains tuples in a binary format while that of the TE-based system manages paginated XML data. Although the two storage manager work differently, their space and buffer managers have the same functionality. Whenever receiving a request, they retrieve one page at a time from a disk. They provide specialized iterators so that the upper layers can retrieve nodes (or elements) from loaded pages. The two storage manager have buffer managers to reduce the number of disk accesses for repeatedly used pages. The storage manager of I-based system follows a general relational storage manager in conventional relational database systems except it returns an element encapsulating a tuple as an XML element. Therefore, we will more focus on the storage of TE-based system. For efficient modeling the temporal element-based data model using XML, we have developed our own storage technology for XML called CanStoreX (Canonical Storage for XML) (Ma, 2004). In order to facilitate pagination of an XML document, our storage technology adds some auxiliary nodes to the document, and pages are self-contained XML documents on their own right. To a client of our DOM API, auxiliary nodes and page boundaries are transparent. Fig. 10 shows an XML document and a corresponding paginated XML document. In the paginated XML document, a c-node contains a page ID pointing to a child node which resides in another page while an f-node groups a sequence of one or more children nodes which are pointed by a c-node. CanStoreX uses a pagination algorithm called dynamic pagination which uses depth first search algorithm. In Fig. 10(b), the black-colored nodes are XML elements which are double visited by the depth first search algorithm. The detailed explanation on CanStoreX is beyond the scope of this paper.

7. Performance evaluation 7.1. Test data configuration Although there are many synthetic XML document generators, it is hard to generate XML data which has time feature. Therefore, we define our own test generation scheme which has following characteristics: For all queries, each tuple should have a valid domain such that a variety of Boolean or domain expressions are satisfied. Each employee tuple has salary and department history information spanning more than 30 years. Salary increases $100 every year and department information is updated every five years.


1940

S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

Fig. 10. XML document and paginated XML document Noh and Gadia (2006), Noh and Gadia (2005a). (a) An XML document and (b) paginated XML document.

Sixty tuples in the interval-based data model are mapped to a single tuple in the temporal element-based data model. Each database (interval-based and temporal elementbased databases) contains 372,385 employee information which is approximately 1 GB. Table 1 shows the information of data used in our benchmark. 7.2. Special treatment for self-join In conventional databases, joins are important, but one of the most expensive operations. Joins are more serious in temporal databases because temporally-varying data dramatically increases the size of a database depending on Table 1 Data setup information Interval-based

Temporal element-based

Employees 372,385 372,385 Tuples 22,343,100 372,385 Pages 259,814 282,407 DB Size 1,039,256 KB 1,129,628 KB *DB size includes dept relation and catalog. *Page size: 4KB. Header size: 20 bytes. Page utilization: 70%.

time duration as well as time granularity. Some methodologies for temporal join operations can be found in literature. Gao et al. (2005) introduced and summarized join operations in temporal databases including Nested-loop join, Sort-merge join, and Partition-based join. These algorithms are all based on relation scans. However, previous work in the temporal database literature is more concentrated on join operations for heterogeneous relations rather than identical relations which we call self-join. Soo et al. (1994) introduced a linear ordered validtime natural join algorithm called Partition-based algorithm. This algorithm partitions relation r and s into n partitions. The join r ffl s is computed by unioning the joins ri ffl si , where 1 6 i 6 n. However, this algorithm cannot guarantee the linear ordered time complexity if r contains many tuples with long intervals. For example, if jbufferj  2 < jri j, the entire partition ri cannot be loaded in the buffer. Even though ri is fit in the buffer, the partition-based algorithm needs n time scans for ri for n-way self-joins. If the size of partitions are small, scanning multiple times may not be problematic. However, temporal data is so accumulative that the partition size can easily exceed the buffer size. In this subsection, we will introduce a self-join algorithm which is not affected by partition sizes as well as implemented by a single scan rather than unioning multiple scans.


S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

As we discussed in Section 5.2, the interval-based data model frequently introduces self-join operations, increasing query complexity. Since join operations are expensive, introducing multiple joins in a query is critical to system performance. It seems obvious that the system performance of the interval-based data model may be affected by self-join operations. However, self-join operations can be implemented as a relation scan by a special treatment. The special treatment for self-join is based on the following observations (Noh and Gadia, 2005b): (1)We can construct a relation by clustering tuples that represent an object. (2)In self-join, there is only one-to-one object mapping. Therefore, an object oi is not related to an object oj if i 6¼ j. These observations provide clues to implement a selfjoin operation without joining multiple identical relations. We create n pointers to implement an n-way self-join. These pointers move forward, pointing to tuples in a buffer. This treatment constructs a condition table to check Boolean conditions. For an illustration, revisit ISQL Query 10 discussed in Section 5.2. In the query, the Boolean expression in WHERE clause is as follows:

AND AND AND AND

E1.Name = E2.Name E2.Name = E3.Name E1.DName=’Hardware’ E2.DName=’Software’ E3.End=NOW

Fig. 11 illustrates how the special treatment avoids multiple scans for the 3-way self-join operation. Each page in Emp relation is allocated to a frame in a buffer pool when it is requested. Since it is 3-way join, there are 3 pointers that point to tuples in the buffer. Each poin-

1941

ter moves forward and the condition table checks Boolean conditions related to the pointer. Note that each pointer stops if its corresponding Boolean conditions are qualified. Whenever Boolean values of predicates in the condition table is changed, the algorithm evaluates if the Boolean conditions in the table have been set to true. If this is the case, tuples for the object are returned in streaming fashion. By using this special treatment for self-join, we can avoid multiple ordinary joins and reduce them as simple relation scans, significantly improving system performance for the interval-based temporal data model.

7.3. Experimental results Table 2 shows the disk block requests and actual block accesses measured from the two temporal database systems. From Table 2, we can note that the interval-based data model achieved 8% lower disk requests and actual disk accesses than the parametric data model, respectively. However, the performance of the interval-based data model is marginally better than the parametric data model; thus, the performance of two systems is comparable. If a query contains self-join operations, the benchmarks show that the interval-based model needs 935% more block requests and 809% more actual accesses than the parametric data model for Q3 and Q9. The interval-based data model needed much higher disk requests and accesses than the parametric data model for Q10. However, if the special algorithm for self-joins is used, the block requests and accesses of the self-joins are significantly dropped to those of a relation scan, showing similar performance to the temporal element-based data model. It should be emphasized that for Q7 and Q8 TE-based system shows much better performance than I-base system. For such queries, ParaSQL separates inner and outer queries and executes them independently so that performance is similar to a simple relation scan. However, ISQL Table 2 Experimental results

Fig. 11. An example of special treatment for 3-way self-join.

Interval-based data model

Parametric data model

Query

Request

Access

Request

Access

Q1 Q2 Q3 Q3a Q4 Q5 Q6 Q7 Q8 Q9 Q9a Q10 Q10a

519,608 519,608 527,402,492 1,072,147 519,608 519,608 850,526 850,526 1,370,134 527,402,492 1,072,147 31,617,544,810 1,072,147

261,143 261,143 229,311,284 261,143 261,143 261,143 427,485 427,485 602,329 229,311,284 261,143 15,808,772,405 261,143

564,815 564,815 564,815 564,815 564,815 1,319,019 564,816 564,816 564,815 564,815 -

283,441 283,441 283,441 283,441 283,441 662,478 283,442 283,442 283,441 283,441 -

a

The special treatment for self-join is applied.


1942

S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943

requires a join of two relations so that it needs more disk accesses.

8. Conclusion Satisfying the closure property makes the temporal element-based data model reduce query complexities at the user level because it avoids invoking self-joins to gather information about an object. The temporal element-based data model also allows more versatile expressions in the RESTRICTED TO clause. As we discussed, for some queries, the ISQL required self-joins while the ParaSQL needed simple relation scans. We have shown that self-joins are inevitable in ISQL if queries have Boolean expressions containing multiple conjunctions. Such conjunctions frequently appear in temporal queries. ParaSQL for Query 8 can be expressed without significant changes from Query 7 (adding a negation was enough) while in ISQL it requires the union of two select statements. In the disk block request and access tests, two data models showed similar performance even though the intervalbased data model showed slightly better performance than the parametric data model for relation scans. To achieve the similar performance, the special treatment for self-joins was required in the interval-based data model. Without the special treatment, the data model was unable to compensate its disadvantages due to multi-way joins. Our query suite is conservative in that it only hints at the difficulties to be faced by users of interval-based approach. However, in practice the situation is expected to be far more serious. For example, let us reconsider Query 10 where we were seeking employees who had worked in Software as well as Hardware departments. What if we wanted to ensure that such employees have had at least 10 years of cumulative experience in respective departments? In ISQL, it becomes even more difficult to express this query because domains to be considered collectively are scattered in multiple tuples. ISQL may require complex use of SQL-style aggregate operations. Aggregate operations are also available in ParaSQL, but this query can be expressed without invoking aggregates as follows: jsE:DName ¼ Softwaretj P 10 ANDjsE:DName ¼ Hardwaretj P 10 It should be emphasized that the temporal elementbased data model extends seamlessly to spatiotemporal databases. In spatial databases, the luxury of using intervals is not even available; spatial domains are far more complex. Our query suite can be readily extended to spatiotemporal data with the advantage that the user complexity of queries will remain similar. For example, the crop production depends upon extended periods of hot days. One may like to query for such periods in different counties in a state. So a county may be required to satisfy a where

clause such as jsTemp P 70 degreestj > 20days. This is a simplified version of queries that in practice can be quite a bit more difficult. One may raise a question about the usability of native XML database systems for temporal data because they also use XML-based storages. In order to answer the question, we conducted the comparisons between the XML-based parametric temporal database system and native XML database systems. Our experiments (Noh and Gadia, 2006) showed that the XML-based parametric temporal database system is more efficient and easy for processing and expressing temporal queries. In this paper, we have investigated the ease of use as well as system performance for the two temporal data models. We hope that our findings help settle a debate which has continued since the mid 1980s for determining which data model is most appropriate for temporal databases. References Bo¨hlen, M.H., 2004. Toward a unifying view of point and interval temporal data model. In: Proceedings of the 11th International Symposium on Temporal Representation and Reasoning, Tatihou Island, Normandie, France, pp. 3–4. Bo¨hlen, M.H., Busatto, R., Jensen, C.S., 1998. Point-versus interval-based temporal data models. In: Proceedings of the Fourteenth International Conference on Data Engineering, pp. 192–200. Gadia, S.K., 1988. A homogeneous relational model and query languages for temporal databases. ACM Transactions on Database Systems 13 (4), 418–448. Gadia, S.K., Chopra, V., 1993. A relational model and SQL-like query language for spatial databases. In: Advanced Database Systems. Lecture Notes in Computer Science, vol. 759. Springer, pp. 213–225. Gadia, S.K., Vaishnav, J.H., 1985. A query language for a homogeneous temporal database. In: Proceedings of the 4th ACM SIGACTSIGMOD Symposium on Principles of Database Systems, Portland, Oregon, USA, pp. 51–56. Gadia, S.K., Chopra, V., Tim, U.S., 1993. An sql-like seamless query language for spatio-temporal data. In: Proceedings of the International Workshop on an Infrastructure for Temporal Databases, Arlington, Texas, USA, pp. Q1–Q20. Gao, D., Jensen, C.S., Snodgrass, R.T., Soo, M.D., 2005. Join operations in temporal databases. VLDB Journal 14 (1), 2–29. Hubler, P., Edelweiss, N., 2000. Implementing a temporal database on top of a conventional database: mapping of the data model and data definition management. In: Proceedings of XVth Brazilian Symposium on Databases, Joao Pessoa, Paraiba, Brazil. Jensen, C.S. et al., 1993. The TSQL benchmark. In: Proceedings of the International Workshop on an Infrastructure for Temporal Databases, Arlington, Texas, USA, pp. QQ1–QQ28. Ma, S., 2004. Implementation of a Canonical Native Storage for XML, Master’s thesis, Department of Computer Science, Iowa State University, Ames, Iowa (December). Navathe, S.B., Ahmed, R., 1993. Temporal extensions to the relational model and SQL. In: Temporal Databases: Theory, Design, and Implementation. Benjamin/Cummings, pp. 92–109. Noh, S.-Y., Gadia, S.K., 2004. A parametric framework for implementing spatiotemporal databases. In: Proceedings of the 2nd International Conference on Computer Science and its Applications, San Diego, California, USA, pp. 312–319. Noh, S.-Y., Gadia, S.K., 2005a. An XML-based framework for temporal database implementation. In: Proceedings of the 12th International Symposium on Temporal Representation and Reasoning, Burlington, Vermont, USA, pp. 180–182.


S.-Y. Noh, S.K. Gadia / The Journal of Systems and Software 81 (2008) 1931–1943 Noh, S.-Y., Gadia, S.K., 2005b. Efficient self-join algorithm in intervalbased temporal data models. Tech. Rep. 05-22, Department of Computer Science, Iowa State University. Noh, S.-Y., Gadia, S.K., 2006. A comparison of two approaches to utilizing XML in parametric databases for temporal data. Information & Software Technology 48 (9), 807–819. Noh, S.-Y., Gadia, S.K., Ma, S., in press. An XML-based methodology for parametric temporal database model implementation. Journal of Systems and Software. Snodgrass, R.T., 1987. The temporal query language TQuel. ACM Transactions on Database Systems 12 (2), 247–298. Soo, M.D., Snodgrass, R.T., Jensen, C.S., 1994. Efficient evaluation of the valid-time natural join. In: Proceedings of the 10th International Conference on Data Engineering, Houston, Texas, USA, pp. 282–292. Tansel, A.U., Clifford, J., Gadia, S.K., Segev, A., Snodgrass, R.T., 1993. A glossary of temporal database concepts. In: Temporal Databases: Theory, Design, and Implementation. Benjamin/Cummings, pp. 92– 109. Toman, D., 1996. Point vs. interval-based query languages for temporal databases. In: Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 58–67.

1943

Toman, D., 1997. Point-based temporal extension of temporal SQL. In: Proceedings of the 5th International Conference on Deductive and Object-Oriented Databases, pp. 103–121. W3C, 2007. Document object model. <http://www.w3.org/DOM> (August). Wang, F., Zaniolo, C., 2004. XBit: An XML-based bitemporal data model. In: Proceedings of the 23rd International Conference on Conceptual Modeling, Shanghai, China, pp. 810–824. Seo-Young Noh received his Ph.D. degree in Computer Science from Iowa State University, USA, in 2006. He is currently a senior research engineer at the LG Electronics Advanced Research Institute. His research interests include spatiotemporal databases, XML, database system implementation, embedded database systems, and natural language processing in database systems. Shashi K. Gadia received the B.S and M.S. degrees in mathematics from the Birla Institute of Technology and Science, Pilani, in 1970 and 1971, respectively, the Ph.D. degree in mathematics from the University of Illinois at Urbana-Champaign in 1978, and the M.S. degree in computer science from Ohio State University in 1980. He is currently an Associate Professor at Iowa State University at Ames. His interests include temporal, spatiotemporal, security, and XML databases, and access methods.


Benchmarking