Asaf Meisels Winter ’09 – Spring ’09 SENIOR PROJECT PROPOSAL Problem Databases are a vital way of storing large amounts of information and allowing us to view this information in any way we specify. Probabilistic databases are a special type of database that provide operations that are needed for probabilistic data, but that don’t exist in regular databases. Probabilistic databases have been a hot topic in computer science lately, but there is still research to be done. There are several implementations that exist, each with their own advantages and disadvantages. A Semi-structured Probabilistic Database allows data to be stored in the database without having to follow a definite structure. The database is made up of Semi-structured Probabilistic Objects (SPO). Each SPO contains four parts: context, variables, the probability table, and conditionals. Besides that structure, SPOs may contain completely different fields. Evan Rosson’s XML implementation claims to use memory more efficiently and have a faster execution time for some operations. In order to validate that this new implementation is in fact more efficient, a Test Suite needs to be created, showing us that it is genuinely faster. Solution A series of tests will need to be run on the database using both implementations filled with “semi-random” data and compare the results in order to determine whether the new implementation is more efficient than the existing system design. The information being stored in the database for the test runs has to be generated randomly so the results of the tests represent more than just a specific case. However, the data will have to be somewhat structured so that we can test the database implementation under certain conditions. The Test Suite will need to be complete because there are infinitely many possibilities for the data stored in the database. I will need to test each operation several times with different test cases to ensure that the results are genuine. The test cases will be created in the most efficient way possible, covering all cases without overlapping. Schedule This project will take me two quarters to complete. For the first half of Winter ’09, I will be doing background research to fully understand the requirements I need to fulfill. The remainder of the quarter will be spent designing the XML Data Generator and the Test Suite. All of Spring ’09 will be spent implementing the Test Suite so it provides us valuable data for measuring the performance and the XML Data Generator so that it creates data based on user-defined input. Meeting Minimum Criteria Independence All of the work done to create the XML Data Generator as well as the Test Suite will be done on my own. Background Research This project requires me to study Semi-structured Probabilistic Databases and how to manage sets of SPOs (Semi-structured Probabilistic Objects) in an XML Database. I will also need to do some research on the topic of testing databases to ensure that my Test Suite will actually provide useful data. Creativity My XML Data Generator will need to be creative so that the data it generates for testing is “semi-random”. On the other hand, the Test Suite will need to be similar to other Test Suites so that the data it outputs is valuable. Currently there is no way to tell how efficient the new implementation is compared to the existing system design.