- français
- English
06.2016 - Switch to PostgreSQL
June 2016 update: Optimizing the lookup phase
Switching from MongoDB to PostgreSQL
After a complete performance review of the lookup phase. We noted several major drawbacks of using MongoDB for Devsearch:
- Slow grouping a large amount of rows.
- Grouping does not leverage available indexes.
- No way to implement a custom scoring function inside the DB (except for map reduce, but it currently yields poor performances)
- 100Mb limit for grouping operations.
- Lack of thorough profiling tools for queries.
- Current slow queries are already using approximations based on rarity to reach faster processing.
Hence we now switched over to PostgreSQL that allows to have a finer control on DB tunning, a better understanding in query plans, allows to normalize data in the DB and yields better results, moreover without approximations.
For a complete description of this new system. Please refer to the paper that explains the process of this migration. This same document contains a first part that explains extensively the inner workings of Devsearch's online part.
Experimention with a custom made C++ lookup DB
Along with migrating to PostgreSQL was implemented a first draft of a custom-made c++ lookup database. It allows to manually manage and tune memory allocation and take advantage of memory locality during queries.
The implementation leverages string dictionaries to lighten the amount of data read during lookup. Then data is stored and read sequentially using a streamlined scoring aggregation phase that minimizes memory allocation. The draft is available on Github and is a the stage of proof of concept.
Advantages:
- Faster
- Easier and more flexible implementation scoring
Disadvantages:
- Prone to bugs
- No support for SQL and no runtime optimization
- Requires to implement a server to accept queries
TODO list:
- Implement a server to accept queries
- Further optimize memory allocation on aggregation
- Implement some sort of peristency (on disk)
- Implement more scoring features (most notably: clustering)