Transform existing database model to HDFS

Currently the AiiDA project stores its computed data in a Relational Database (MySql) and MongoDB formats.

These data formats aren't optimal for future analysis if enough data starts to be stored. The aim of this part is to understand the currently used entities and to find a way to store them in a way to have them optimally queryable.

This project is composed of multiple parts. From the current usage of the AiiDA project there are already quite a bit of data store in MySQL which can be exported in a json format.

Currently the AiiDA team experiences a bottleneck when inserting new data, to elevate this we are moving to a cluster supported database format.

 

Of major importance is that the current high level implementation of the framework needs to be transparently supported. This means that only in the beginning the storage option should be specified, but then the user can use the framework in the same way and does not need to know if the underlying data storage is for the big data cluster or the local optimised version.

For the cluster version we intially planned to use HBase as our database model.  HBase is a wide column store which after first inspection of the data appeared to be well suited. After further research we discovered that the work of the query team would become very complex due to a missing framework which would facilitate relation operations to be executed. The main advantage of HBase would be the write optimisation, but as AiiDA is immutable (once written the data will for the most part not allow any modification) and should be read optimised. An efficient join on HBase requires a deep understanding of the architecture and only experts should attempt it. Even then joins are hard to implement to work in all possible combinations. Usually joins are avoided by de-normalisation, something we should not be doing for this project as it is not suitable due to the nature of queries required by the users of AiiDA.

 

We therefore decided to meet with a database expert to find out if we could use to a more suitable format to store and query the data on the cluster.

Instead of HBase we will use Parquet to store the data. Parquet is also a column store but it interacts through an SQL like syntax with spark and allows to perform joins in an efficient way. In addition it uses and optimised compression algorithm which will allow faster reads on the data as most likely for experimental data there is a huge potential for it to be compressed per column.

As the data created from the framework is highly structured, with the use of SparkSQL it will allow for a efficient way to query it. The queries return the data in the RDD format which would then allow to run analytics on them in a parallelized way with sparks core implementation


 

What will be worked on:

The query and data mining team need as fast as possible some data structures to test their implementations. Therefore the data migration team is on the critical path and needs to provide them with a set of data as fast as possible.

There is a huge dataset of structures created by one of the Collaborators of the same lab where the AiiDA project was born. Unfortunately the data is stored in a different format than the normal AiiDA project.

 


 

Therefore the team will first focus on creating a parser which will allow to take the external dataset (created from a PHD thesis) and import it into the Parquet format. As soon as this is finished the other teams will have enough structures to start implementing their parts -> Milestone 1

 


 

The next step will be to create a parser which allows to import the data created from the AiiDA Project (json format). This will give the query and data mining team even more raw data to work with. Specifically there will be new data types to be stored. In the first dataset there are only Structures, from the dataset of AiiDA there will be structures, calculations, code and more. Those structures are not isolated, but are linked together through a graph structures which also needs to be stored -> Milestone 2

 


 

Once all the data is stored and ported the team needs to work on the full integration with the rest of the AiiDA framework, this means that the storage procedure now needs to support transparently no matter which underlying technology (Mysql or Parquet) is used.

Additionally a well written documentation needs to be supported to explain how the new system can be installed and configured during the installation of AiiDA -> Milestone 3

 


 

 

Tentative Schedule

Milestone 1:

16.3.15 - 22.3.15: Migrate the data structures from the exported json format (from MongoDB) to an HBase compatible format. Create a parser which takes the json document and creates the wide column store for the structures.

21.04.15 skonecki - added something that pulls documents from smallexports and pushes them into hbase https://github.com/BIGDATA2015-AIIDA-EXTENSION/data_migration/blob/master/import_mongo/import_json.py so if you have an hbase setup ( I use https://github.com/dajobe/hbase-docker) you can try it

23.3.15 - 29.3.15: Discovering that the HBase format is not optimally suited for the query team. Restart the search for better solution to facilitate the implementation of the query team.

30.3.15 - 5.3.15: Split the work for the two data sets (mongo and mysql data) the team works in parallel on both sets and provides the schema description of the parquet files to the Query team

6.4.15 - 12.4.15: Vacation -> Team agreed to work during the vacation to implement the parsers for the two milestones

6.4.15 - 12.4.15: Delivering Milestone 1 and 2 together (finished the parsers to import the data), Milestone 1 was delayed due to the change in the underlying technology

13.4.15 - 19.4.15: ...