Map-Reduce API

DRAFT

Map reduce flow on the p2p network :

  1. An Edge (Peer P) receives a Job, consisting of Job1(JobId, ResourcesNeeded, MapFunction, ReduceFunction)
  2. Peer P adds a MapFile M1(JobId, ResourcesNeeded, MapFunction, InitiatorPeer) to the index, with all chunks empty. The content of each chunk of the MapFile can be obtained by applying the Map function of Job1 on the Resources. Each chunk corresponds to the mapping of a chunk of the ressources. Then, these chunks will be further splitted into several parts for each key the mapper discovers (KeyChunks)
  3. Peer P advertises that the index has been modified with a new MapFile M1.
  4. Neighbors receive the update. They check if they can already get chunks of the MapFile from their neighbors. Otherwise, they check which Resource they have, and choose some randomly to create the chunks (Map).
  5. Each time a Mapper finishes mapping a chunk, it sends a ChunkMapped(JobId, ChunkId, List(keys)) msg, with all the keys discovered in the chunk, to the Initiator. The initiator keeps track of the keys, of the mappers and of the finished mappings to facilitate the work of the reducers.
  6. For each new key the initiator receives, it creates a ReduceFile R1(JobId, Key, ResourcesNeeded, ReduceFunction, List(Mapper Peers), InitiatorPeer) and signals an index update. The resourcesNeeded are all the KeyChunks corresponding to the Key. So a reducer will need to get the content of the right KeyChunks of every Chunks created during the Map phase before starting.
  7. Neighbors that ask for a ReduceFile of a particular key become responsible for the reduction of this key. They have to get the ReduceFile chunks by either grabbing the KeyChunks and Reducing them, or copying them from another Source like another file(preferred way).
  8. When a reducer has all the KeyChunks of its Key, it applies the Reduce function on them to fill the ReduceFile. When the reducing is completed, the reducer sends a Message ChunkReduced(JobId, Key) to the initiator so that it can keep track of the progress.
  9. The operation is known to be finished when all ReduceFiles are stabilized, or when the initiator has received ChunkReduced msg for all keys in its list, from alive peers.

Sources :

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

http://research.google.com/archive/mapreduce.html