 français
 English
Bigdata2015hypercubejoins
This wiki page is about the Big Data course's project Hypercube Joins.
Hypercube Partitioning for nway Joins
The goal:
To implement Hypercube Partitioning for nway joins in Squall/Storm, a distributed online processing engine. In contrast to MapReduce which does batch processing, Squall/Storm [1] is designed for online processing.
Hypercube partitioning
At a high level, the join operator in a distributed setting consists mainly of two tasks:
a) partitioning data tuples among machines (reducers) and
b) performing the join locally on each reducer.
Partitioning schemes
Key partitioning: Traditionally, join processing in a distributed setting relies on key partitioning [2] of the data such that tuples from opposite relations with the same key eventually reside on the same machine. This technique works only for equijoins. Furthermore, even in the case of equijoin, the load will be irregularly distributed amongst machines in the presence of data skew. In particular, a small number of machines processes most of the keys, whereas others are idle most of the time. This prevents load balancing and leads to very inefficient execution plans.
OneBucket partitioning: An alternative partitioning scheme called one bucket partitioning is proposed by Okcan et al. [3]. Instead of partitioning the input tuples by the join key, conceptually the output tuples are partitioned across machines. The output tuples are represented as cells in a matrix called a join matrix. Figure 1 illustrates a band join matrix between two relations R1 and R2, where the dark cells represents output tuples. We cover the matrix with rectangular regions, such that each output cell belongs to exactly one region, and each region is assigned a single machine. An incoming tuple is replicated among all the regions (machines) in a row or a column of a matrix. The objective is to minimize the input tuple replication. We made a short presentation of main concepts in the paper [4].
Hypercube partitioning: Hypercube partitioning is a generalization of the join matrix model from 2way joins to nway joins. Instead of matrices, we have hypercubes, which are divided into smaller hypercubes (instead of regions). Again, the objective is to minimize the input tuple replication size of each sub hypercube (machine load). This is formalized and explained in Section 5.1 of Zhang’s paper [5]. In this paper, the authors are using Hilbert space filling curve
Our current team members:
 Guliyev Khayyam
 Gueniat Patrice
 Nguyen Thanh Tam
 Vu Minh Khue
 Wang Zisi
Supervisor:
Vitorovic Aleksandar
Deliverables and Task Assignment

Familiarize with Squall (1 week)

Implement the hypercube partitioning given the number of reducers (1 week)  Khayyam and Tam

– Evaluate the tradeoff between communication and computation: by trying different number of reducers (kr) for a hypercube join, validate the fact that there is an optimal kr.

– How it depends on the local join implementation?


Integrate it with Squall (1 week)  Khayyam and Tam

Local indexes for more efficient execution of equijoins, bandjoins and inequality joins in Squall (1 week)  Patrice

Experiments & Identify TPCH [6] queries which could be executed with out any communication among the machines of the same operator. Some of the nested queries certainly does not qualify. (3 weeks)  Khue and Zisi

Set up Squall/Storm in Microsoft Azure images, with a tutorial

Identify TPCH [6] queries which could be executed without any communication among the machines of the same operator. Some of the nested queries certainly does not qualify. For each TPCH query that requires communication among the machines of the same operator, precisely explain what needs to be communicated and between which machines.

evaluate the performance of a nway hypercubejoin (n = 3, 4) versus the corresponding pipeline of 2way onebucket joins.

What is the effect of the total number of tuples sent over the network (including the intermediate stages) on the total execution time?

How it depends on the local join implementation (SquallToaster, HashIndexes, Balanced Binary Tree)?

 Writeup (1 week)  All
Note that the assignment is tentatively. We can change during the semester based on the actual work load.
References
[1] “Squall: A distributed online processing system.” https://github.com/ epfldata/squall.
[2] “Mapreduce/mapreduce relational join in the coursera course introduction to data science.” https://class.coursera.org/datasci001/lecture/ preview.
[3] A. Okcan and M. Riedewald, “Processing thetajoins using mapreduce,” SIGMOD 2011.
[4] “Thetajoins on mapreduce.” https://www.youtube.com/watch?feature= player_detailpage&v=x4R2hbkA5Ik.
[5] X. Zhang, L. Chen, and M. Wang, “Efficient multiway thetajoin processing using mapreduce,” in VLDB, 2012.
[6] “The TPCH benchmark.” http://www.tpc.org/tpch/.
[7] “Squall local mode.” https://github.com/epfldata/squall/wiki/
QuickStart:LocalMode.
[8] S. Chaudhuri and V. Narasayya, “TPCD data generation with skew.”