CSC553 Lab 3: SimpleDB Extended Operators and Efficiency

In this lab assignment, you will write the code for the aggregate operator. This along with other operators will complete the needed SQL operators and provide you with a database system that can perform simple queries over multiple tables. We will then test the system on larger amounts of data, which may require you to iterate and improve the algorithms you chose to implement in the first place.

1. Getting started

You will need to add these new files to your release. The easiest way to do this is to untar the new code in the same directory as your top-level simpledb directory, as follows:

1.3. Implementation hints

As before, we strongly encourage you to read through this entire document to get a feel for the high-level design of SimpleDB before you write code.

As before, we will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets test and systemtest. See Section 3.3 for a complete discussion of grading and list of the tests you will need to pass.

Here's a rough outline of one way you might proceed with your SimpleDB implementation; more details on the steps in this outline, including exercises, are given in Section 2 below.

2. SimpleDB Architecture and Implementation Guide

2.1. Aggregates

In order to calculate aggregates, we use an Aggregator interface which merges a new tuple into the existing calculation of an aggregate. The Aggregator is told during construction what operation it should use for aggregation. Subsequently, the client code should call Aggregator.mergeTupleIntoGroup() for every tuple in the child iterator. After all tuples have been merged, the client can retrieve a DbIterator of aggregation results. Each tuple in the result is a pair of the form (groupValue, aggregateValue), unless the value of the group by field was Aggregator.NO_GROUPING, in which case the result is a single tuple of the form (aggregateValue).

Note that this implementation requires space linear in the number of distinct groups. For the purposes of this lab, you do not need to worry about the situation where the number of groups exceeds available memory.

Exercise 1. Implement the skeleton methods in:

src/simpledb/IntegerAggregator.java
src/simpledb/StringAggregator.java
src/simpledb/Aggregate.java

At this point, your code should pass the unit tests IntegerAggregatorTest, StringAggregatorTest, and AggregateTest. Furthermore, you should be able to pass the AggregateTest system test.

2.2. Query Parser and Efficiency

The first step is to create some data tables and a catalog. Suppose you have a file data.txt with the following contents:

Finally, invoke the parser. You must run java from the command line (ant doesn't work properly with interactive targets.) From the simpledb/ directory, type:

Exercise 2: Please execute the three queries below using your SimpleDB prototype and report the times in your lab write-up.

Contest (Optional)

We have built a SimpleDB-encoded version of the DBLP database; the needed files are located at: https://dice.cs.depaul.edu/courses/553/labs/lab3/dblp_data.tar.gz

You should download the file and unpack it. It will create four files in the dblp_data directory. Move them into the simpledb directory. The following commands will acomplish this, if you run them from the simpledb directory:

wget http://dice.cs.depaul.edu/courses/553/labs/lab3/dblp_data.tar.gz
tar xvzf dblp_data.tar.gz
mv dblp_data/* .
rm -r dblp_data.tar.gz dblp_data

You can then run the parser with the following command. Make sure to first compile (i.e., run ant dist):

java -jar dist/simpledb.jar parser dblp_simpledb.schema

We will start a thread on the course message board inviting anyone interested to post their runtimes for the following three queries (please run the queries on a lab machine and indicate which one you used so it becomes easier to compare runtimes). The contest is just for fun. It will not affect your grade:

SELECT p.title
FROM papers p
WHERE p.title LIKE 'selectivity';

SELECT p.title, v.name
FROM papers p, authors a, paperauths pa, venues v
WHERE a.name = 'E. F. Codd'
AND pa.authorid = a.id
AND pa.paperid = p.id
AND p.venueid = v.id;

SELECT a2.name, count(p.id)
FROM papers p, authors a1, authors a2, paperauths pa1, paperauths pa2
WHERE a1.name = 'Michael Stonebraker'
AND pa1.authorid = a1.id 
AND pa1.paperid = p.id 
AND pa2.authorid = a2.id 
AND pa1.paperid = pa2.paperid
GROUP BY a2.name
ORDER BY a2.name;

Note that each query will print out its runtime after it executes.

You may wish to create optimized implementations of some of the operators; in particular, a fast join operator (e.g., not nested loops) will be essential for good performance on queries 2 and 3.

There is currently no optimizer in the parser, so the queries above have been written to cause the parser to generate reasonable plans. Here are some helpful hints about how the parser works that you may wish to exploit while running these queries:

The table on the left side of the joins in these queries is passed in as the first DbIterator parameter to Join.
Expressions in the WHERE clause are added to the plan from top to bottom, such that first expression will be the bottom-most operator in the generated query plan. For example, the generated plan for Query 2 is:
```
Project(Join(Join(Filter(a),pa),p))
```

Our reference implementation can run Query 1 in about .35 seconds, Query 2 in about 10.5 seconds, and Query 3 in about 40 seconds. We implemented a special-purpose join operator for equality joins but did little else to optimize performance.

3. Logistics

3.1. Collaboration

3.2. Submitting your assignment

To submit your code, please create a CSC553-lab2.tar.gz and submit it to the class D2Lsubmissions folder. You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11pm on the due date). Please also submit your individual writeup as a PDF, plain text file (.txt), .doc, or .docx.

Make sure your code is packaged so the instructions outlined in section 3.3 work.

Please also submit your runtimes for the three queries in the contest. While the contest is optional, it is mandatory that your SimpleDB prototype be capable of executing these queries. You can also post on the class message boardif you feel you have run into a bug.

3.3 Grading

50% of your grade will be based on whether or not your code passes the system test suite we will run over it. These tests will be a superset of the tests we have provided. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest.

Important: before testing, we will replace your build.xml, HeapFileEncoder.java, and the entire contents of the test/ directory with our version of these files! This means you cannot change the format of .dat files! You should therefore be careful changing our APIs. This also means you need to test whether your code compiles with our test programs. In other words, we will untar your tarball, replace the files mentioned above, compile it, and then grade it. It will look roughly like this:

An additional 50% of your grade will be based on the quality of your writeup and our subjective evaluation of your code.