Bigdata2015-AiiDA-extension

AiiDA Extension Project

 

This is NOT the official AiiDA project page, please follow this link to get to the official page.

Abstract : what is AiiDA ?

 

Technological breakthroughs are often associated with the discovery of new materials. Just to mention a couple of examples, think about lithium batteries, new materials for efficient solar cells, multiferroics for improved data storage devices, new materials for microelectronics, ... Discovering a new material therefore means being able to make significant impact on the daily life of billions of people.

 

Until a few years ago, the discovery process followed a time-consuming and slow process of trial and error in the laboratory, where hundreds of new materials needed to be synthesized and measured to possibly find a promising new material. Nowadays, however, materials properties can be accurately simulated by computers using advanced quantum-mechanical simulation tools that run on thousands of computer cores in big clusters or supercomputers. Therefore, a new approach called "High-throughput computational design of new materials" is emerging as the tool towards the discovery of novel and more efficient materials for the most disparate applications (energy storage, energy harvesting, thermoelectrics, ...). Thousands of simulations can be launched to predict the properties even of materials that have not been synthesized yet.

 

The scientific community, however, has started only now to adapt to this new "Big Data" approach to computational science, and up to now calculations are typically automated only with a bunch of custom scripts rewritten for each specific problem, using heterogeneous languages (shell scripts, awk, ...) and with the results sparsely stored in poorly organized files and directory structures.

 

To bridge the gap between quantum-mechanical simulations and Big Data, we have developed the Python software framework AiiDA (Automated Interactive Infrastructure and Database for Atomistic simulations, www.aiida.net), that provides an integrated solution for calculation of materials properties and analytics of results, whose philosophy stands on four main pillars:

The source code has been released here.

A quick overview of AiiDA for computer scientist has been written to have the jist of the AiiDA framework in a few lines and is available here.

 

Goals of this Project

All the coding involved in the project is available here : https://github.com/BIGDATA2015-AIIDA-EXTENSION

Part 1: Improving the existing AiiDA framework

AiiDA uses Django's orm system to execute queries. The framework exposes a simpler `Querytool` to help the end user ask fundamental question on the generated data from the simulations.

 

We analyzed the current implementation and the Database Schema and the most frequent queries executed. We improved the current implementation by replacing the Django orm by SQLalchemy and implemented prefetching of nodes

 

Worked on by: Jocelyn Boullier, Souleimane Driss El Kamili

 

Part 2: Propose alternative solutions for the Backend

The current implementation of AiiDA can be used with a multitude of SQL backends (Postgresql, MySQL, SQLite). The aim of this part is to find alternative backends which are more specialized for the queries and data used and generated by AiiDA.

We implemented and benchmarked the follwoing alternatives which might be used in the future:

 

 

Please see the dedicated pages on those for further details and analysis.

 

Worked on by: Alexander Carlessi, Roger Küng, Souleimane Driss El Kamili

 

Part 3: Provide a simplified procedure to deploy AiiDA

 

Setting up an AiiDA instance can be a lot of work and potentially require a lot of time.

To simplify the process we created specific docker files which allow a deployment of AiiDA nearly with one click.

For further detail please see the dedicated page for the Dockerfile.

 

Worked on by: Arthur Skonecki

 

 

Part 4: Structure Analysis

QUery tool

What has been done after the refocus of the project:

We found that the data provided by the AiiDA team used in the query framework were not near the size of what we call “big data”. We decided to focus on improving the existing query tool. The first issue that we were asked to solve was the prefetching of the attributes. Previously, they had to issue a query to first get the node back, then another query for each node if an attribute was needed. This is highly inefficient. This problem was first solved by implementing prefetching using the Django’s ORM in the original query tool.

During the implementation, limitations of the Django’s ORM to be able to fine tune the queries appeared. It was decided to try using a more complete ORM called SQL Alchemy. While being more complex because more close to SQL, it does not (or at least much much less than Django) get in the way when trying to optimize or to convert SQL query to Python. A rewrite of the query tool was done, with the possibility to switch backend. The idea is that the new query tool takes care of providing a consistent interface to the user, type check the parameters or format them in a common them, to then use a query builder to create the query. Currently SQL Alchemy is the only one used, but with the other work that has been done on looking for alternative backend, we can imagine using it to easily switch between them.

The reimplementation of the Query Tool has also been the occasion to write an API to define the features it should have, and how it should provide them to the user. A lot of addition were discussed with Giovanni that goes beyond the scope of this project. Some of them still needs to be specified.

Finally, from the new API, two new features were implemented. The first one was the ability to query the transitive closure table; that is, filter nodes depending on their children or parents (separated by more than one edge like we could do before). The second one is the possibility to chain query between them, allowing more complex query to be created, even if right now, it is only using a naïve approach to achieve this.

We also wrote a graph generation script to be able to benchmark and test the limit of the AiiDA Query Tool. In conjunction with the Titan benchmarks, this provided us insight on the structure of the graph. In particular, they use a transitive closure table to handle query on path at the cost of higher and more difficult insert time. Our first generation capped the SERIAL integer limit on postgresql with only 40K nodes. The graph we were generated were too connected, and were not representing clearly the AiiDA database. Our second version came more closely to the connectivity of the data we were provided with.

TODO:

 

Structure Comparison

 

Doctor Martin Uhrin has during his PHD generated million of virtual structures.

 

We were given access to the mongoDB database containing the structures.

We first exported the database to a json format and uploaded it to the cluster hdfs file system.

The file accounts for 40GO and 12 millions of structures.

 

We then wrote a parser in Scala in order to be able to process the structure via a Spark job.

 

In order to do computations on these structure, doctor Martin Uhrin developed a tool in C++ (more than 40000 LOC).

 

We ported a subset of the tool in Scala to be able to compare structures with each other.

Basically, the tool would compute the distances between each pair of atoms in a structure and sort them. Then two structures are similar if the root mean square of the difference between the two set of distances is inferior to a tolerance (In practice it is much more complicated).

 

With the scala implementation of the structure comparator, we wrote several Spark jobs in order to analyse the given structures.

 

1) Finding similarity with natural structures

We were given access to a small set of natural structure (couple thousand), and were asked to find similarity between the natural and the virtual structures.

 

A structure has the following properties: it has a set of atoms and each atom belongs to a specie (Gold, Oxygen...). For two structures to be comparable, they need to have the same number of atoms and the same species.

The virtual structures have only species A and B which are not real species.

In order to compare the natural and virtual structures, we renamed the natural structure species as follow:

Let’s take the example of the H2O structure. After renaming, we would have two distinct structures A2B and B2A which are now comparable with the synthetic structures.

So we map the natural species to all the possible permutations in the virtual species.

 

We also need to normalize the distances between the atoms to have relevant results.

The distances between atoms must be normalized because two similar structures could have been measured in different environments (“unit cell”), or in different states. For instance, two structures could be similar, with one being a repetition of the other. The difference between two structures is defined by a normalized distance between atoms rather than an absolute distance.



 

The Spark job would consist of:

  1. Parsing the virtual and natural structures

  2. Normalize the distances between atoms

  3. Rename the natural structures to match the virtual ones

  4. Join the two set of structures

  5. For every pair of natural and virtual structure, check if they are similar

  6. Group the similar structures together

 

We have multiple implementation of this Spark job in order to have an efficient implementation in different situations (input sizes...). The jobs differ by the type of join and the schedule of the operations.

 

2) Finding duplicates in the synthetic structures

The Spark job would consist in:

  1. Parsing the virtual structures

  2. Normalize the distances between atoms

  3. Join the structures with themselves

  1. For every pair of structures, check if they are similar

  2. Group the similar structures together

 

Another implementation which is much more efficient but may lead to false positives and negatives is to cluster the similar structures together and merge the clusters together in a reduce phase if these clusters share similar structures.

 

 

Data aggregation and extraction from Structure collection

 

Data used

Dr. Martin Uhrin has generated a large database of structures for the sake of his PhD thesis. This database was then used to extract a lot of information and generate graphs with it. Here is an example :5V8S87M.png

 

The main problem with this was the time it took to perform the necessary aggregations to get the data used in those plots. It took around 5 hours to build the ones in the thesis.

 

Using spark we managed to get better performance at extracting the information for those plots. The algorithm we developed was generating maps from a list of input parameters, each element of the input list leading to a whole family of maps. Here is some timing information, given the size of the input list (note that those sizes are realistic, the amount of structures aggregated to generate a map is large, not the input parameters):

 

1 element -> 5 minutes 32 seconds

2 elements -> 3 minutes 35 seconds

3 elements -> 5 minutes 18 seconds

4 elements -> 4 minutes 33 seconds

5 elements -> 3 minutes 51 seconds

6 elements -> 3 minutes 44 seconds

7 elements -> 4 minutes 49 seconds

8 elements -> 4 minutes 21 seconds

 

The differences between the results are mainly due to circumstances and the usage of the cluster at the time the jobs were ran.

 

On average a job took 4 minutes and 28 seconds which is arount 67 times better than before our intervention.

Detailed documentation of how those maps are generated is available on the code itself.

Clustering


The goal of this part of the project was to detect whether a structure is two dimensional or not.


To do this we first expand the structure to a supercell, this means that we periodically repeat the atoms of the structure in all 3 dimensions. Then we do a hierarchical clustering with minimal linkage on these atoms using the euclidean distance between them.



Hierarchical clustering


We used the hierarchical clustering in an agglomerative way, each atom is it’s own cluster and we merge clusters until we have the expected number of clusters. The find the clusters to merge we use the minimal linkage approach because it is the one that best fitted our needs to find layers. We implemented the clustering the following way: first we compute the pairwise distance between all atoms, then we sort these distances in ascending order and finally we iterate through these distances and merge the clusters if they are not already in the same cluster.


We chose hierarchical clustering because we needed to cluster layers and not spheres. Using K-Mean would have failed very badly in situations like these, where the layers are near together.




 

Metrics


One problem of the clustering method we used was to find the correct number of clusters.

First we tried the Calinski-Harabasz index which did not work well as it uses the variance in “3d” and was thus not suited to our need.

After that developed a metric using the rank of a matrix. For each cluster and each atom, we computed a matrix of distances between the atom and the replicates of this atom inside the cluster. This gave us a list of ranks for each cluster from which we took the maximum. And finally we had a rank for each cluster and took the maximum rank as metric for the clustering.

We also used a very simple metric that is the minimal size of a cluster.

The final metric that we use computes a plane for each cluster. This plane minimizes the sum of the distances between any atom in the cluster and the plane. We use the mean of the distances as a metric which is very useful to see if a cluster is 2d or not.


By combining the minimal cluster size and the plane metric we were able to get the optimal number of clusters all the time.


 

 

 

Worked on by: Allan Renucci, Jeremy Rabasco, Lukas Kellenberger, Martin Duhem

Team

Amir Shaikhh (TA)

Roger Küng (Team Lead)

Jeremy Rabasco

Artur Skonecki

Alexandre Carlessi

Jocelyn Boullier

Lukas Kellenberger

Martin Duhem

Allan Renucci

Souleimane Driss El Kamili