DEADLINES
Due: Sunday, May 15, 2022Version History:
In this lab assignment, you will write the code for the aggregate operator. This along with other operators will complete the needed SQL operators and provide you with a database system that can perform simple queries over multiple tables. We will then test the system on larger amounts of data, which may require you to iterate and improve the algorithms you chose to implement in the first place.
You will need to add these new files to your release. The easiest way to do this is to untar the new code in the same directory as your top-level simpledb directory, as follows:
$ cp -r CSC553-lab2 CSC553-lab3
$ wget http://dice.cs.depaul.edu/courses/553/labs/lab3/CSC553-lab3-supplement.tar.gz
tar -xvzf CSC553-lab3-supplement.tar.gz
As before, we strongly encourage you to read through this entire document to get a feel for the high-level design of SimpleDB before you write code.
As before, we will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets test
and systemtest
. See Section 3.3 for a complete discussion of
grading and list of the tests you will need to pass.
Here's a rough outline of one way you might proceed with your SimpleDB implementation; more details on the steps in this outline, including exercises, are given in Section 2 below.
IntegerAggregator
and StringAggregator
. Here, you will write the
logic that actually computes an aggregate over a particular field across
multiple groups in a sequence of input tuples. Use integer division for
computing the average, since SimpleDB only supports integers. StringAggegator
only needs to support the COUNT aggregate, since the other operations do not
make sense for strings.
Aggregate
operator. As with other
operators, aggregates implement the DbIterator
interface
so that they can be placed in SimpleDB query plans. Note that the
output of an Aggregate
operator is an aggregate value of an
entire group for each call to next()
, and that the
aggregate constructor takes the aggregation and grouping fields.
At this point you should be able to pass all of the tests in the ant
systemtest
target, which is the goal of this lab.
GROUP BY
clause. You should implement the five SQL aggregates
(COUNT
, SUM
, AVG
, MIN
,
MAX
) and support grouping. You only need to support aggregates
over a single field, and grouping by a single field.
In order to calculate aggregates, we use an Aggregator
interface which merges a new tuple into the existing calculation of an
aggregate. The Aggregator
is told during construction what
operation it should use for aggregation. Subsequently, the client code
should call Aggregator.mergeTupleIntoGroup()
for every tuple in the child
iterator. After all tuples have been merged, the client can retrieve a
DbIterator of aggregation results. Each tuple in the result is a pair of
the form (groupValue, aggregateValue)
, unless the value
of the group by field was Aggregator.NO_GROUPING
, in which
case the result is a single tuple of the form (aggregateValue)
.
Note that this implementation requires space linear in the number of distinct groups. For the purposes of this lab, you do not need to worry about the situation where the number of groups exceeds available memory.
The first step is to create some data tables and a catalog. Suppose
you have a file data.txt
with the following contents:
1,10 2,20 3,30 4,40 5,50 5,50You can convert this into a SimpleDB table using the
convert
command (make sure to type ant first!):
java -jar dist/simpledb.jar convert data.txt 2 "int,int"This creates a file
data.dat
. In addition to the table's
raw data, the two additional parameters specify that each record has
two fields and that their types are int
and
int
.
Next, create a catalog file, catalog.txt
,
with the follow contents:
data (f1 int, f2 int)This tells SimpleDB that there is one table,
data
(stored in
data.dat
) with two integer fields named f1
and f2
.
Finally, invoke the parser.
You must run java from the
command line (ant doesn't work properly with interactive targets.)
From the simpledb/
directory, type:
java -jar dist/simpledb.jar parser catalog.txtYou should see output like:
Added table : data with schema INT(f1), INT(f2), SimpleDB>Finally, you can run a query:
SimpleDB> select d.f1, d.f2 from data d; Started a new transaction tid = 1221852405823 ADDING TABLE d(data) TO tableMap TABLE HAS tupleDesc INT(d.f1), INT(d.f2), 1 10 2 20 3 30 4 40 5 50 5 50 6 rows. ---------------- 0.16 seconds SimpleDB>The parser is relatively full featured (including support for SELECTs, INSERTs, DELETEs, and transactions), but does have some problems and does not necessarily report completely informative error messages. Here are some limitations to bear in mind:
Exercise 2: Please execute the three queries below using your SimpleDB prototype and report the times in your lab write-up.
Contest (Optional)
We have built a SimpleDB-encoded version of the DBLP database; the needed files are located at: https://dice.cs.depaul.edu/courses/553/labs/lab3/dblp_data.tar.gz
You should download the file and unpack it. It will create four files in the
dblp_data
directory. Move them into
the simpledb
directory. The following commands will acomplish
this, if you run them from the simpledb
directory:
wget http://dice.cs.depaul.edu/courses/553/labs/lab3/dblp_data.tar.gz tar xvzf dblp_data.tar.gz mv dblp_data/* . rm -r dblp_data.tar.gz dblp_data
You can then run the parser with the following command. Make sure to first compile (i.e., run ant dist):
java -jar dist/simpledb.jar parser dblp_simpledb.schema
We will start a thread on the course message board inviting anyone interested to post their runtimes for the following three queries (please run the queries on a lab machine and indicate which one you used so it becomes easier to compare runtimes). The contest is just for fun. It will not affect your grade:
SELECT p.title FROM papers p WHERE p.title LIKE 'selectivity';
SELECT p.title, v.name FROM papers p, authors a, paperauths pa, venues v WHERE a.name = 'E. F. Codd' AND pa.authorid = a.id AND pa.paperid = p.id AND p.venueid = v.id;
SELECT a2.name, count(p.id) FROM papers p, authors a1, authors a2, paperauths pa1, paperauths pa2 WHERE a1.name = 'Michael Stonebraker' AND pa1.authorid = a1.id AND pa1.paperid = p.id AND pa2.authorid = a2.id AND pa1.paperid = pa2.paperid GROUP BY a2.name ORDER BY a2.name;
Note that each query will print out its runtime after it executes.
You may wish to create optimized implementations of some of the operators; in particular, a fast join operator (e.g., not nested loops) will be essential for good performance on queries 2 and 3.
There is currently no optimizer in the parser, so the queries above have been written to cause the parser to generate reasonable plans. Here are some helpful hints about how the parser works that you may wish to exploit while running these queries:
DbIterator
parameter to Join
.
Project(Join(Join(Filter(a),pa),p))
Our reference implementation can run Query 1 in about .35 seconds, Query 2 in about 10.5 seconds, and Query 3 in about 40 seconds. We implemented a special-purpose join operator for equality joins but did little else to optimize performance.
To submit your code, please create a CSC553-lab2.tar.gz and submit it to the class D2Lsubmissions folder. You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11pm on the due date). Please also submit your individual writeup as a PDF, plain text file (.txt), .doc, or .docx.
Make sure your code is packaged so the instructions outlined in section 3.3 work.
Please also submit your runtimes for the three queries in the contest. While the contest is optional, it is mandatory that your SimpleDB prototype be capable of executing these queries. You can also post on the class message boardif you feel you have run into a bug.
50% of your grade will be based on whether or not your code passes the system test suite we will run over it. These tests will be a superset of the tests we have provided. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest.
Important: before testing, we will replace your build.xml, HeapFileEncoder.java, and the entire contents of the test/ directory with our version of these files! This means you cannot change the format of .dat files! You should therefore be careful changing our APIs. This also means you need to test whether your code compiles with our test programs. In other words, we will untar your tarball, replace the files mentioned above, compile it, and then grade it. It will look roughly like this:
$ gunzip CSC553-lab3.tar.gz $ tar xvf CSC553-lab3.tar $ cd ./CSC553-lab3 [replace build.xml, HeapFileEncoder.java, and test] $ ant test $ ant systemtest [additional tests]If any of these commands fail, we'll be unhappy, and, therefore, so will your grade.
An additional 50% of your grade will be based on the quality of your writeup and our subjective evaluation of your code.