Quick overview of AiiDA for computer scientists

Quick overview of AiiDA for computer scientists

 

1 Introduction to AiiDA

AiiDA is a python layer that builds itself on top of a database. It is meant to be an abstraction to be able to compute and query specific crystal structures.

Its main goal is to data mine crystal structures, so that we can discover interesting ones in an automated and less supervised way.

AiiDA is mainly meant for material scientists and physicists thus using a fairly high level language.

2 Data generation

All calculations are done on high-performance clusters (HPC), which are accessed remotely

AiiDA is a framework that pushes the computation for you on the HPCs. It’s fully automated and will take care of performing the calculations and retrieving the outputted data for you. It will also handle the database and will provide you a high level querying language to easily perform search over results without extensive knowledge of the backend mechanism.

AiiDA keeps track of all computation history with their properties such as its state or on which machine it was executed.

3 Queries

In AiiDA, queries are done in Django on top of which they have built their own query tool.

The query tool was made to be able to perform queries without extensive knowledge of database languages or internal storage mechanism. The query tool is an abstraction layer to the “real” queries specially build for scientists without extensive database understanding. Not that this feature is under constant development and is highly likely to change and to offer more options in a near future.

A type of query that is very common is the discovery of whether two nodes are linked through a path in the AiiDA graph database, regardless of how many nodes are in between. The TCT was built to enable this feature. What TCT dose under the hood, is to create edges to avoid a depth in the tree.

Redundancy buys the speed

These additional edges are redundant since the nodes were already linked. But it turns out that this overly connected graph is more easily queryable and thus we note a better performance.

Gain of speed

With this feature, we can perform BFS instead of normal DFS algorithms which decrease a lot the time of execution of a query. This optimization is the core reason of AiiDA’s performance and works very well on small datasets.

But using TCT has a big drawback, it uses way much space and the database will grow exponentially with respect of number of elements.

It’s not hard to imagine that when we have more data this solution will not be valid. In fact in our tests we have managed to hit the maximum number of TCT edges (since this has a 32-bit addressable space) by generating too “much data” which is clearly un-scalable! In our opinion, there is some work that needs to be done to coop with this limitation and to bring a true scalability to AiiDA.

If we look carefully what AiiDA’s querytool is trying to implement, we notice that it in fact rewriting a graph querying language on top of a relational database system. From here we asked ourselves if it wasn’t preferable to use a real graph database that would already have such a language by default and would support specific graph queries natively.

4 Backend

AiiDA currently supports multiple backends.

This database is proposed just to test out AiiDA but is not suitable for production lines. Even though this solution works, it is neither scalable nor performant.

This is the database that the AiiDA developer community recommends. It’s powerful and supports many different features.

Yet another possible backend. But it has some drawbacks that make us prefer PostgreSQL, such as max length on text columns or time precision… Nevertheless it’s still a good way to go and has descant response time

As of today, these backends are the only ones that are supported. But the AiiDA developers are searching and testing out new ones and comparing their implementation, so it’s likely to see new backends come up. Currently, the team is envisaging the possibility of using graph database such as Titan or Neo4j.

5 Future?

As we have seen the Transitive Closure Table is going to arise problems in the future when AiiDA will have to scale and will be brought into a production stage.

So the next challenges will be to find a way so that everything is scalable and easily maintainable.

Bring AiiDA to the BIG world

The Main concern is that AiiDA will have to face big data, so why not prepare it as of today? We should already start thinking about tomorrow. With that being said, we know that huge clusters are going to wait AiiDA so we started looking for suitable database system that would be strong enough to face tomorrows’ challenges.

As stated before, we went on researching if a graph database system could be more suited for AiiDA since it would have most of the functions that they are trying to implement enable by default. As of today, we have done benchmarks for two different graph database, namely Titan and Neo4j.

We selected them after reviewing over 20 different DBS. These two ones where, in our mind, the most suited for this purpose.

Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is running on 3 different backends (Appache Cassandra, HBase and Oracle BerkeleyDB)

Neo4j is an open-source graph database, implemented in Java. It is mainly used in the marketing industries for high speed trading. Its database can be accessed remotely either by shell or by web browser which makes it particularly easy to maintain.

At the end we benchmarked the two DBS against AiiDA using sample queries selected by us. The dataset that was queried was also chosen by us and is not alike a real AiiDA dataset (irrelevant tables have been dropped for test purpose)