- français
- English
Quick overview of AiiDA for computer scientists
Quick overview of AiiDA for computer scientists
1 Introduction to AiiDA
- What is AiiDA?
AiiDA is a python layer that builds itself on top of a database. It is meant to be an abstraction to be able to compute and query specific crystal structures.
- What mission does it accomplish?
Its main goal is to data mine crystal structures, so that we can discover interesting ones in an automated and less supervised way.
- To whom is it relevant?
AiiDA is mainly meant for material scientists and physicists thus using a fairly high level language.
2 Data generation
- Supercomputers
All calculations are done on high-performance clusters (HPC), which are accessed remotely
- AiiDA does the job for you!
AiiDA is a framework that pushes the computation for you on the HPCs. It’s fully automated and will take care of performing the calculations and retrieving the outputted data for you. It will also handle the database and will provide you a high level querying language to easily perform search over results without extensive knowledge of the backend mechanism.
- Computation history
AiiDA keeps track of all computation history with their properties such as its state or on which machine it was executed.
3 Queries
In AiiDA, queries are done in Django on top of which they have built their own query tool.
- Query tool: an abstraction for ease of use
The query tool was made to be able to perform queries without extensive knowledge of database languages or internal storage mechanism. The query tool is an abstraction layer to the “real” queries specially build for scientists without extensive database understanding. Not that this feature is under constant development and is highly likely to change and to offer more options in a near future.
- Transitive Closure Table (TCT)
A type of query that is very common is the discovery of whether two nodes are linked through a path in the AiiDA graph database, regardless of how many nodes are in between. The TCT was built to enable this feature. What TCT dose under the hood, is to create edges to avoid a depth in the tree.
Redundancy buys the speed
These additional edges are redundant since the nodes were already linked. But it turns out that this overly connected graph is more easily queryable and thus we note a better performance.
Gain of speed
With this feature, we can perform BFS instead of normal DFS algorithms which decrease a lot the time of execution of a query. This optimization is the core reason of AiiDA’s performance and works very well on small datasets.
But using TCT has a big drawback, it uses way much space and the database will grow exponentially with respect of number of elements.
It’s not hard to imagine that when we have more data this solution will not be valid. In fact in our tests we have managed to hit the maximum number of TCT edges (since this has a 32-bit addressable space) by generating too “much data” which is clearly un-scalable! In our opinion, there is some work that needs to be done to coop with this limitation and to bring a true scalability to AiiDA.
- Re-implementation of a graph database
If we look carefully what AiiDA’s querytool is trying to implement, we notice that it in fact rewriting a graph querying language on top of a relational database system. From here we asked ourselves if it wasn’t preferable to use a real graph database that would already have such a language by default and would support specific graph queries natively.
4 Backend
AiiDA currently supports multiple backends.
- SQLite
This database is proposed just to test out AiiDA but is not suitable for production lines. Even though this solution works, it is neither scalable nor performant.
- PostgreSQL
This is the database that the AiiDA developer community recommends. It’s powerful and supports many different features.
- MySQL
Yet another possible backend. But it has some drawbacks that make us prefer PostgreSQL, such as max length on text columns or time precision… Nevertheless it’s still a good way to go and has descant response time
- Alternatives?
As of today, these backends are the only ones that are supported. But the AiiDA developers are searching and testing out new ones and comparing their implementation, so it’s likely to see new backends come up. Currently, the team is envisaging the possibility of using graph database such as Titan or Neo4j.
5 Future?
As we have seen the Transitive Closure Table is going to arise problems in the future when AiiDA will have to scale and will be brought into a production stage.
- One word: scalability
So the next challenges will be to find a way so that everything is scalable and easily maintainable.
Bring AiiDA to the BIG world
The Main concern is that AiiDA will have to face big data, so why not prepare it as of today? We should already start thinking about tomorrow. With that being said, we know that huge clusters are going to wait AiiDA so we started looking for suitable database system that would be strong enough to face tomorrows’ challenges.
- Our research
As stated before, we went on researching if a graph database system could be more suited for AiiDA since it would have most of the functions that they are trying to implement enable by default. As of today, we have done benchmarks for two different graph database, namely Titan and Neo4j.
We selected them after reviewing over 20 different DBS. These two ones where, in our mind, the most suited for this purpose.
Titan is a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster. Titan is running on 3 different backends (Appache Cassandra, HBase and Oracle BerkeleyDB)
Neo4j is an open-source graph database, implemented in Java. It is mainly used in the marketing industries for high speed trading. Its database can be accessed remotely either by shell or by web browser which makes it particularly easy to maintain.
At the end we benchmarked the two DBS against AiiDA using sample queries selected by us. The dataset that was queried was also chosen by us and is not alike a real AiiDA dataset (irrelevant tables have been dropped for test purpose)