- français
- English
Meeting Notes 2015.03.10
https://docs.google.com/document/d/1-ZacPBFFL8UweusyPBi53KeyHFYPY5aXc3TZVPgUzL4/edit?usp=sharing
I. Progress done this week:
-
Install, run, and test Squall in local mode
-
Run different configs successfully
-
Run with IDE, troubleshooting with Eclipse Luna:
-
If create a new Java project in a different location: must copy the squall/test/ folder or configure the resource such as satisfying the config file:
DIP_DATA_ROOT ../test/data/tpch/
DIP_SQL_ROOT ../test/squall/sql_queries/
DIP_SCHEMA_PATH ../test/squall/schemas/tpch.txt
DIP_RESULT_ROOT ../test/results/
-
Understand query plan and topology
2) Understand existing Squall libraries:
-
One-bucket partitioning plan_runner/thetajoin/matrix_mapping/EquiMatrixAssignment.java
-
function compute() - calculates region dimensionality for given the size of the join matrix (sizeS,sizeT) and the number of reducers (r). The output is the number of reducer divisions for each dimension.
-
createRegionMatrix() assigns partition for corresponding cell
-
getRegionIDs() return the list of regions id for a dimension (row or column)
3) TPCH dataset:
-
22 queries in DBGen.zip/tpch_2_17_0/dbgen/queries/
4) Fork Squal repo - https://github.com/khgl/squall.git
-
This is our own copy, so we can easily work on it. (Just please create new branch and work on it in order to make clean the master branch.)
5) HyperCube partitioning is not trivial task:
-
We should start with just Cube partitioning - given 3 relations and number of Reducers.
6) StormThetaJoin class uses MatrixAssignment interface. For higher Dimensions there would be many changes - because we can not continue with simple java array structure.
-
This will affect ch.epfl.data.plan_runner.operators;
7) We should change our Deliverables plan:
+ It is not reasonable to expect to finish both Hypercube partitioning and Squal integration in 2 weeks.
8) Tentative Plan to implement brute force partitioning for hypercube.
-
Define an interface equivalent to MatrixAssignment, e.g. CubeAssignment, HyperCubeDimension
-
Define a flexible number of dimension (not using enum MatrixAssignment.Dimensionality)
-
Define a list of region dimensions, i.e. “the number of reducers” for each dimension.
-
Write the compute function for partitioning
-
Tentative algorithm:
Step 1: solve the following equations:
-
r_d1 x r_d2 x … r_dn = r (the number of reducers)
-
S_d1 x r_d2 x r_d3 x … r_dn = S_d2 x r_d1 x r_d3 x … r_dn= ....
Example 1: matrix partition: R1 x R2, and a pre-defined number of reducers r
We have the equations:
r_1 x r_2 = r
|R1| x r_2 = |R2| x r_1
⇒ r_1^2 = r x |R1| / |R2|
Example 2: cube partition: R1 x R2 x R3 and a pre-defined number of reducers: r
We have the equation:
r_1 x r_2 x r_3 = r
|R1| x r_2 x r_3 = |R2| x r_1 x r_3 = |R3| x r_1 x r_2
⇒ r_1^3 = r x |R1|^2 / (|R2| x |R_3|)
Step 2: Verify with computation cost (p_d1 x p_d2 x p_dn) and communication cost (p_d1 + p_d2 + p_dn), where p is the largest partition (region)
Step 3: of course we have to deal with fraction cases but we can group the remaining cells in the same partition. For example, S_d1 = 137 tuples, S_2 = 391 tuples, … the equations will not give integer number, so we have to do some extra steps to round up the missing cells.
II. Discussions in the meeting:
Codebase plan:
+ Multi-way version of ThetaJoinStaticComp class: need to change or new class (e.g. MultiJoinComponent)
+ Multi-way version of StormThetaJoin class: need to define new class (e.g. StormMultiJoin)
+ Multi-way version of MatrixAssignment: need to change or define new class (e.g. HyperCubeAssignment)
+ Multi-way version of ThetaJoinStaticMapping: changes or new class
+ Implement CustomStreamGrouping?
Understanding:
-
StormThetaJoin = partitioning (assignment) + local join index (processNonLastTuple, perform Join)
-
ThetaTPCH9Plan
-
StormThetaJoin: indexes, join predicate, isContentSensitive, acknowledge, topology
III. Plan for next week:
- Tentative interface and implementation for multi-way version of MatrixAssignment and StormThetaJoin.
- Set up local storm cluster to get familiar, in preparation for setting up experiment env in Microsoft Azur.