- français
- English
Meeting Notes 2015.03.17
Source: https://docs.google.com/document/d/18ehqmRDrfxgGV6ZCT2LbH_107T6Fqz-kliQwfONTg3k/edit?usp=sharing
I. Progress done this week:
1. HyperCube partitioning:
-
ch.epfl.data.plan_runner.thetajoin.matrix_mapping
-
HyperCubeAssignment.java ✔ (general interface for n-way partitioning)
-
CubeNAssignment.java ✔(brute implementation for n-way partitioning = find best partition as hypercubes)
-
Utilities.java ✔(add some utilities during implementation)
-
CubeAssignmentAdapter.java ✔(old implementation for brute-force 2,3,4-way partitioning)
-
CellIterator.java, CombinationIterator.java, SetCombinationIterator.java ✔ (some iterators part of the implementation of brute-force n-way partitioning)
-
-
Testcases: input ([] sizes, r) --> output (int[] _rd) where sizes are the size each dimension, r is the number of reducers, _rd is the #reducers by each dimension
-
13, 7, 1 → 1, 1
-
4, 4, 4, 8 → 2, 2, 2
-
[4, 4, 4, 4], 16 → 2-2-2-2
-
[8, 4, 10, 7], 1000 → 5-4-10-5
-
2. Squal and HyberCube integration :
-
plan_runner.components.hyper_cube;
-
HyperCubeJoinComponentFactory ✔
-
HyperCubeJoinStaticComponent ✔
-
-
plan_runner.storm_components.hyper_cube;
-
StormHyperCubeJoin ✖
-
-
plan_runner.utilities.MyUtilities;
-
thetaAttachEmitterComponents ✖
-
thetaAttachEmitterComponentsWithInterChanging ✖
-
-
plan_runner.utilities.hypercube_static;
-
HyperCubeJoinStaticMapping ✖
-
-
plan_runner.predicates;
-
Predicate - I am not sure about this class, can we use as its or modify ✖
-
-
plan_runner.visitors;
-
PredicateCreateIndexesVisitor ✖
-
PredicateVisitor ✖
-
3. Set up Squall cluster
-
Set up Squall cluster
-
Run sample Hyracks query with this cluster
-
Explore https://github.com/miguno/wirbelsturm for fast deployment of the cluster
-
Build the code with storm-0.9.3 instead of 0.9.2 incubating
-
Documented the process in the wiki page: https://wiki.epfl.ch/bigdata2015-hypercubejoins/squallclusterdevelopment
II. Discussions in the meeting:
1. Task Assignment
-
Local index task: it means generalizing the existing indexes for multi-way join process (i.e. refactoring), no need to implement new indexing techniques. Detail: reimplement the visitor on the Predicate
2. Hypercube Partitioning:
-
Perform experiment with computation time: vary the number of relations (<20), and the number of machines (<2000)
-
Deal with poorly factorized numbers of machines (e.g. the number with no prime factor or power of 2). Tentative solution: find the nearest number with lots of prime factor
-
Best metric: max #tuples per machine (i.e. consider only computation cost and not communication cost?). Input: relation sizes, #joiners, #dimensions
3. Multi-way Theta join
-
Understanding
-
Component = unit of parallelization
-
Each partition performs a join with a chain of single-relation operators (select, agg, project)
-
joinParams (list of columns)
-
joinPredicate: now for n-relation, e.g. P_1 P_2 … P_{n-1} where P_i = R_i \join R_{i+1}
-
createIndexes()
-
joinPredicate.accept(visitor)
-
visitor creates a list of indexes for each relation, then using these to performJoin and processNonLastTuple, output is a set of tuples
-
Example: R.A = S.A & S.B <= T.C
-
Hash on R.A
-
Hash on S.A and Btree on S.B
-
Btree on T.C
-
Concrete value: R(5) \join S(5) \join T(5, + \infinity)
b. Refactoring problems:
P1) create indexes: visitors, get inner expression, get type, look at invocations, look at sql package.
predicate: ColumnReference, Value Specification, e.g. R.A = S.A + 5
P2) order of joins
P3) visitors on predicate, indexes on different joins
Assumptions: R \join S \join T, only 1 ColumeReference on each side of a join expression?
P4) Update indexes:
-
valuesToApplyOnIndex
-
Need to know key, value
-
PredicateUPdateVisitor
-
valuesToIndex
-
Types of values Index
-
select tuple: join expression will be evaluated in a concrete-value basis.
-
selectTupleToJoin: 8 < T.C (if S comes first) or S.A + 5 < 8 (if T comes first)
-
process: R.A = 8 (1 tuple) → look up S.A = 8 (n tuples) → eval S \join T → lookup T.C
III. Plan for next week:
-
Define tasks for Patrice and Charles: 1 for local indexes, 1 for refactoring codebase with Khayyam, and someone for experiment tasks.
-
General potential changes in the codebase as a result of refactoring
-
Preliminary experiment on the hypercube partitioning.
-
Open Window Azure free one-month trial account and set up storm cluster
-
Identify TCPH queries, which queries involve communication between nodes.