See Moodle or this PDF document.


We want to build a YARN application performing transactional key-value storage. In other words, it should be able to run transactions in parallel on different node of the cluster. Each of these transactions consists of key-value operations.The key design principle here is locality, i.e.:

The high-level architecture is illustrated in the schema below. It consists of a client (maybe several), a ResourceManager which has some knowledge of the network, and a bunch of Transaction Managers which runs transactions. We think that each transaction will be given by the user an associated locality hash which determines which Transaction Manager will master the transaction execution (the Resource Manager is aware of which locality hash maps to which Transaction Manager). Then, this would be the responsibility of the master Transaction Manager to communicate with other Transaction Managers if it needs data.

The other key design principle here is the use of hashing. The transaction algorithms (2PL, MVCC) might require exchanges between nodes but we want to keep this number small by having a hash function mapping keys to nodes easily (avoiding indirect exchanges with the Resource Manager to know where the data is).


Transaction manager

The transaction manager is responsible for a bunch of things:

  1. Running transactions (in YARN containers?); and
  2. Dealing with the transactional algorithms (concurrency control, versioning, ...).
To keep the design as modular as possible, we decided to separate the concurrency controller module from the versioning and lockings units. The concurrency controller is responsible for running transaction in the proper way. To do that properly, it in turn interacts with versioning and locking units.


The project might be very ambitious for a 9-weeks project. We therefore consider some simplifications in the design that then can be unsimplified at the end of the project (some of them at least).

Key-values are be stored in non-persistant memory.

The main topic of this project being the transaction manager, not the key-value store, we consider that local in-memory storage is enough.

Dummy machine assignments based on the hash of the key.

At first, we plan to use very simple hard-coded static hashing mechanism even if we hope to come up with a distributed dynamic hashing mechanism by the end of the project.

Transaction Managers do not communicate and no data is exchanged.

The transaction managers only deal with their local data and never access data they don't have locally. This is the simplification that must be unsimplified first.


Deliverable 1: Feasibility Study to Build on Top of YARN

The main goal of this first stage would be to come up with a design inside YARN to show that the design is indeed possible. This mainly consists in studying YARN and setting up the system with dummy components.

How do we build these nodes in YARN? Philémon, Georges
Simple 1-day in-memory key-value store implementation Georges
Client-Transaction Manager communication Ege, Sachin
Concrete definition of interfaces of project's modules Andy, Florian

Estimated duration: 2 weeks - Deadline: March, 23th

Deliverable 2: Initial Prototype

The goal of this second stage is to implement the key modules with all the above-mentioned simplifications.

MVTO implementation, versioning subsystem Florian, Andy
Locking subsystem
Georges, Philémon
2PL, MVCC2PL implementation Sachin, Ege
Transaction Manager infrastructure implementation Georges, Philémon

Estimated duration: 4 weeks - Deadline: April, 29th

Deliverable 3: Make it work well!

The goal of this last phase is to get rid of the maximum number of simplifications discussed earlier. 

Benchmarking system Andy, Florian

Network infrastructure needed for distribution
and deadlock prevention, cluster testing

Georges, Philémon
2-phase commit implementation Sachin
Centralized dead lock prevention Ege
Test framework for concurrent schedule testing Sachin, Andy & Philémon