CSC553 Lab 3: SimpleDB Extended Operators and Efficiency

DEADLINES

Due: Sunday, May 15, 2022

Version History:

 

Lab Description

In this lab assignment, you will write the code for the aggregate operator. This along with other operators will complete the needed SQL operators and provide you with a database system that can perform simple queries over multiple tables. We will then test the system on larger amounts of data, which may require you to iterate and improve the algorithms you chose to implement in the first place.

1. Getting started

You should begin with the code you submitted for Lab 2. If you did not submit code for Lab 2, or your solution didn't work properly then you skip this lab completly and return your focus on Lab 2.

You will need to add these new files to your release. The easiest way to do this is to untar the new code in the same directory as your top-level simpledb directory, as follows:

1.3. Implementation hints

As before, we strongly encourage you to read through this entire document to get a feel for the high-level design of SimpleDB before you write code.

As before, we will grade your assignment by looking at your code and verifying that you have passed the test for the ant targets test and systemtest. See Section 3.3 for a complete discussion of grading and list of the tests you will need to pass.

Here's a rough outline of one way you might proceed with your SimpleDB implementation; more details on the steps in this outline, including exercises, are given in Section 2 below.

2. SimpleDB Architecture and Implementation Guide

2.1. Aggregates

An additional SimpleDB operator implements basic SQL aggregates with a GROUP BY clause. You should implement the five SQL aggregates (COUNT, SUM, AVG, MIN, MAX) and support grouping. You only need to support aggregates over a single field, and grouping by a single field.

In order to calculate aggregates, we use an Aggregator interface which merges a new tuple into the existing calculation of an aggregate. The Aggregator is told during construction what operation it should use for aggregation. Subsequently, the client code should call Aggregator.mergeTupleIntoGroup() for every tuple in the child iterator. After all tuples have been merged, the client can retrieve a DbIterator of aggregation results. Each tuple in the result is a pair of the form (groupValue, aggregateValue), unless the value of the group by field was Aggregator.NO_GROUPING, in which case the result is a single tuple of the form (aggregateValue).

Note that this implementation requires space linear in the number of distinct groups. For the purposes of this lab, you do not need to worry about the situation where the number of groups exceeds available memory.

Exercise 1. Implement the skeleton methods in: At this point, your code should pass the unit tests IntegerAggregatorTest, StringAggregatorTest, and AggregateTest. Furthermore, you should be able to pass the AggregateTest system test.

2.2. Query Parser and Efficiency

We've provided you with a query parser for SimpleDB that you can use to write and run SQL queries against your database once you have completed the implementation of all your operators in this lab.

The first step is to create some data tables and a catalog. Suppose you have a file data.txt with the following contents:

1,10
2,20
3,30
4,40
5,50
5,50
You can convert this into a SimpleDB table using the convert command (make sure to type ant first!):
java -jar dist/simpledb.jar convert data.txt 2 "int,int"
This creates a file data.dat. In addition to the table's raw data, the two additional parameters specify that each record has two fields and that their types are int and int.

Next, create a catalog file, catalog.txt, with the follow contents:

data (f1 int, f2 int)
This tells SimpleDB that there is one table, data (stored in data.dat) with two integer fields named f1 and f2.

Finally, invoke the parser. You must run java from the command line (ant doesn't work properly with interactive targets.) From the simpledb/ directory, type:

java -jar dist/simpledb.jar parser catalog.txt
You should see output like:
Added table : data with schema INT(f1), INT(f2), 
SimpleDB> 
Finally, you can run a query:
SimpleDB> select d.f1, d.f2 from data d;
Started a new transaction tid = 1221852405823
 ADDING TABLE d(data) TO tableMap
     TABLE HAS  tupleDesc INT(d.f1), INT(d.f2), 
1       10
2       20
3       30
4       40
5       50
5       50

 6 rows.
----------------
0.16 seconds

SimpleDB> 
The parser is relatively full featured (including support for SELECTs, INSERTs, DELETEs, and transactions), but does have some problems and does not necessarily report completely informative error messages. Here are some limitations to bear in mind:

Exercise 2: Please execute the three queries below using your SimpleDB prototype and report the times in your lab write-up.

Contest (Optional)

We have built a SimpleDB-encoded version of the DBLP database; the needed files are located at: https://dice.cs.depaul.edu/courses/553/labs/lab3/dblp_data.tar.gz

You should download the file and unpack it. It will create four files in the dblp_data directory. Move them into the simpledb directory. The following commands will acomplish this, if you run them from the simpledb directory:

wget http://dice.cs.depaul.edu/courses/553/labs/lab3/dblp_data.tar.gz
tar xvzf dblp_data.tar.gz
mv dblp_data/* .
rm -r dblp_data.tar.gz dblp_data

You can then run the parser with the following command. Make sure to first compile (i.e., run ant dist):

java -jar dist/simpledb.jar parser dblp_simpledb.schema

We will start a thread on the course message board inviting anyone interested to post their runtimes for the following three queries (please run the queries on a lab machine and indicate which one you used so it becomes easier to compare runtimes). The contest is just for fun. It will not affect your grade:

  1. SELECT p.title
    FROM papers p
    WHERE p.title LIKE 'selectivity';
    
  2. SELECT p.title, v.name
    FROM papers p, authors a, paperauths pa, venues v
    WHERE a.name = 'E. F. Codd'
    AND pa.authorid = a.id
    AND pa.paperid = p.id
    AND p.venueid = v.id;
     
  3. SELECT a2.name, count(p.id)
    FROM papers p, authors a1, authors a2, paperauths pa1, paperauths pa2
    WHERE a1.name = 'Michael Stonebraker'
    AND pa1.authorid = a1.id 
    AND pa1.paperid = p.id 
    AND pa2.authorid = a2.id 
    AND pa1.paperid = pa2.paperid
    GROUP BY a2.name
    ORDER BY a2.name;
     
    

Note that each query will print out its runtime after it executes.

You may wish to create optimized implementations of some of the operators; in particular, a fast join operator (e.g., not nested loops) will be essential for good performance on queries 2 and 3.

There is currently no optimizer in the parser, so the queries above have been written to cause the parser to generate reasonable plans. Here are some helpful hints about how the parser works that you may wish to exploit while running these queries:

Our reference implementation can run Query 1 in about .35 seconds, Query 2 in about 10.5 seconds, and Query 3 in about 40 seconds. We implemented a special-purpose join operator for equality joins but did little else to optimize performance.

3. Logistics

You must submit your code (see below) as well as a short (2 pages, maximum) writeup describing your approach. This writeup should:

3.1. Collaboration

All CSC553 labs are to be completed INDIVIDUALLY! However, you may discuss your high-level approach to solving each lab with other students in the class.

3.2. Submitting your assignment

To submit your code, please create a CSC553-lab2.tar.gz and submit it to the class D2Lsubmissions folder. You may submit your code multiple times; we will use the latest version you submit that arrives before the deadline (before 11pm on the due date). Please also submit your individual writeup as a PDF, plain text file (.txt), .doc, or .docx.

Make sure your code is packaged so the instructions outlined in section 3.3 work.

Please also submit your runtimes for the three queries in the contest. While the contest is optional, it is mandatory that your SimpleDB prototype be capable of executing these queries. You can also post on the class message boardif you feel you have run into a bug.

3.3 Grading

50% of your grade will be based on whether or not your code passes the system test suite we will run over it. These tests will be a superset of the tests we have provided. Before handing in your code, you should make sure it produces no errors (passes all of the tests) from both ant test and ant systemtest.

Important: before testing, we will replace your build.xml, HeapFileEncoder.java, and the entire contents of the test/ directory with our version of these files! This means you cannot change the format of .dat files! You should therefore be careful changing our APIs. This also means you need to test whether your code compiles with our test programs. In other words, we will untar your tarball, replace the files mentioned above, compile it, and then grade it. It will look roughly like this:

$ gunzip CSC553-lab3.tar.gz
$ tar xvf CSC553-lab3.tar
$ cd ./CSC553-lab3
[replace build.xml, HeapFileEncoder.java, and test]
$ ant test
$ ant systemtest
[additional tests]
If any of these commands fail, we'll be unhappy, and, therefore, so will your grade.

An additional 50% of your grade will be based on the quality of your writeup and our subjective evaluation of your code.