Create high level python query framework

The AiiDA project exposes a high level python API for the scientists to ask question about the results of the calculations. On a lower level these queries are executed in either SQL queries or through the MongoDB query engine.

A big data query engine framework needs to be chosen (MapReduce, Pig, Hive, Spark...) which will allow to answer the same questions (or even more) in an efficient way and the same higher level python API as the current implementation needs to be supported.

Another improvement could be to find a way to remember sub computations (for instance hash of input and operations and store the output) to be able to not recompute them if another computation requires them or if a part of a computation fails.

 

Initial state:

Currently, the AiiDA project has two defined ways of querying the database. The first one is the Django ORM. It is intended to people with knowledge of the underlying database and the schemas (typically, the project's developers). The other one is the Query Tool. It is an abstraction over the Django ORM and the inner workings of AiiDA' storage mechanism. It works by adding filters on the attributes of the Node before querying the Database.

Milestones:

Milestone 1: Have a working development environment. Port the Query Tool to work with SparkSQL. Further define with the AiiDA project's team the querying needs.

Milestone 2: Develop a simpler version of the Django ORM to work with SparkSQL. The aim of this milestone is to provide a complete enough implementation to handle most of the needs of the default AiiDA user. This milestone will be further defined depending on the discussion with the AiiDA's team.

Milestone 3: Integrate the query engine with AiiDA, with simplicity of use and deployment for end users in mind.

 

Weekly Schedule:

Week 16.3.15 - 22.3.15: Set up local installation of HBase and make it work with spark. Implement some toy examples to get a feel on how HBase can be queried with spark

Week 23.3.15 - 29.3.15: Discover that HBase isn’t really suited for the queries performed by the query tool, joins appear to be very frequent and HBase supports join very poorly. Discussion with the team lead and motivation to move to another technology to store the data to facilitate the implementation and extension of the query tool

Week 30.3.15 - 5.4.15: Familiarisation with the newly chosen technology (SparkSQL and Parquet). Start implementing the Query tool

Week 6.4.15 - 12.4.15: Vacation -> But the query team agreed to invest time to work on the query tool to recuperate the time lost due to the change in technology

Week 13.4.15 - 19.4.15: Delivering the first Milestone, full implementation of the query tool