Neo4j

Neo4j Backend for AiiDAneo4j-logo-2015.png

 

Introduction

 

 

Neo4j is an open-source graph database, implemented in Java. Neo4j was first released in 2010 and its adoption has grown ever since.

 

 

What we implemented

 

 

First of all we build a dockerfile that can automatically build a Neo4j instance on a machine by anyone. This was the easiest way to be sure that the experience and benchmarking could be reproduced. Now what is not in the dockerfile is the installation of the plugin, as it takes manual intervention (still in early stage) and the configuration. We provide an example of configuration file on the git repo that was used for the benchmarking. Note that configuration files have most of security features disabled and should not be used as such.

 

 

To be able to provide our benchmarking  results, we created scripts that exports data from AiiDA in a csv format. From there, we re-import the data in Neo4j using gremlin.

 

 

For Neo4j to be able to support gremlin we had to install a plugin (https://github.com/neo4j-contrib/gremlin-plugin) since it’s native laguage is Cypher. Bear in mind that the plugin is in very early stage thus it does not work out of the box. Some manual adjustments need to be done in order to make it work.

 

 

Further work that can be done

 

 

It is clear that this database need a lot of optimization and some effort should be invested in rethinking the data representation. For the moment the data scheme are optimized for relational DBS (such as MySQL) and are not suited at all for graph database systems.

 

 

Another thing is that we used a plugin to run gremlin queries and it is not its native language. Therefore, by removing the plugin and rewriting the queries one could expect an increase in performance. Originally we kept the gremlin scripts to be certain to have the same kind of data and queries between Titan and Neo4j.

 

 

Lastly but not least, we did not used indexation at all which prevents Neo4j from unleashing its true power. We wanted to benchmark the worst case scenario and this is the reason that we hadn’t made the use of it. But a lot of Neo4j’s performance depends on it and thus by rebuilding the data in a good graph oriented scheme and by clever indexation one should be able to acquire much better results then what we could achieve.