Squall cluster development with Wirbelsturm

 
This wiki page documents the steps to set up a cluster development environment for Squall. We will leverage Wirbelsturm, a tool based on Vagrant and Puppet to rapidly provide a Storm cluster environment for development. 

1) Install Wirbelsturm:

(For a more detailed instruction, you can look at Wirbelsturm’s README page)
- Install Vagrant: http://www.vagrantup.com/
- Install VirtualBox: https://www.virtualbox.org/
- (Optional) Install parallel: `brew install parallel` in Mac with brew (similarly in other systems)
- Clone the repository: https://github.com/miguno/wirbelsturm
- Cd to this cloned directory and run `./boostrap` (only need to run this once)
 
If all go well, run `./deploy` to start the cluster. After this step, you will have a running Storm cluster in your local machine. The Storm UI can be accessed at http://localhost:28080/.
You can use `vagrant status` to see the list of running VM. The storm master is named “nimbus1”, and 2 slaves are “supervisor{1,2}”. ssh to these machine by `vagrant ssh hostname`
 
Install ruby 1.9.3 via rvm in Mac Yosemite:
I experienced troubles installing it on my Mac Yosemite though. The main problem is that Wirbelsturm’s boostrap relies on rvm to install ruby 1.9.3. This version can’t be compiled with Apple’s Clang. The solution is to manual install ruby: 
- Download and install gnu’s gcc - You can use `brew install gcc`. 
- Explicitly use this gcc to compile ruby: `CC=/usr/local/Cellar/gcc48/4.8.4/bin/gcc-4.8 rvm install 1.9.3` in which CC is the path to your gcc installation
- `rvm list` to check if the current version of ruby is 1.9.3. If not switch to this version
- `gem install bundler` to install bundler
- Then in the Wirbelsturm installation folder: 
     - bundle install
     - ./boostrap —skip-ruby
 

2) Install local version Squall and Storm 0.9.3: 

You can follow the instruction here https://github.com/epfldata/squall/wiki/Quick-Start:-Cluster-Mode
The basic steps are : 
- Clone the squall repository
- Download storm-0.9.3 and put it in `squall/storm-0.9.3`
- Go to `squall/bin`. Run ./install.sh . It will install jar files in `contrib` folder to your maven local repository (~/.m2) which is necessary for the compilation step. 
 
Modifications to use Storm 0.9.3 
Current squall code use storm 0.9.2-incubating version but Wirbelsturm comes with 0.9.3.
- In `squall/deploy/project.clj`, change `storm-core “0.9.2-incubating” ` to `storm-core “0.9.3”`
- In ResultsGenerator, change `import com.google.common.io.Files` to `import org.apache.storm.guava.io.Files`
- In `squall/bin/lein`, in line 121, change $CLASSPATH:$LEIN_JAR to $LEIN_JAR:$CLASSPATH. storm 0.9.3 dependencies contain a version of jline.jar which conflict with the version Lein is using. So we need to move the lein jar in front of the classpath that lein is using to build. 
 
- Finally, in `squall/bin`, run ./recompile.sh. The squall-standalone.jar file should be produced in the deploy folder. 
 

3) Config storm client to talk with the cluster. 

- Create ~/.storm/storm.yaml. 
- Add the host name of the master node as mentioned above:
nimbus.host: “nimbus1”
You can install `vagrant plugin install vagrant-hostsupdater` so that vagrant automatically add entries to your local machine’s /etc/hosts. Otherwise you can check the nimbus1 ip and replace it with ip address. 
- In `squall/bin/storm_env.sh`, comment out the MASTER variable and update STORM_INSTALL_DIR to your local storm installation, STORMNAME to `storm-0.9.3` or what ever you name this folder. The rest can be ignored. 
 
- Copy contrib/jsqlparser-0.7.0.jar to storm/lib/jsqlparser-0.7.0.jar in your local installation. This is necessary because when submitting our squall topology using storm client, sql query parser is performed, and the storm’s lib folder is in the its class path. 
 

4) Deploy the dataset to the cluster

The squall installation comes with a ../test/data folder. When a query is run on the cluster, Squall’s DataSource component need to be able to access these data. 
 
- In Wirbelsturm folder, update the Vagrantfile by adding the following entry: 
c.vm.synced_folder “SQUALL_INSTALLATION_FOLDER/test/data", "/data", create: true
This will instruct Vagrant to create a sync /data folder in the vm to the host machine specified path. 
Run `vagrant reload` to reload the vms. 

 

5) Prepare the query configuration: 

 
Let’s say we want to run the Hyracks query. 
Create `0_01G_hyracks` similar to ../test/squall/confs/cluster/1G_hyracks. But
- Change DIP_DATA_ROOT to /data/tpch/. This point to the sync folder we created previously. 
- Change STORAGE_CLUSTER_DIR to /app/squall/storage.
 
Additionally, we have to create /app/squall/storage in the VMs. I do this manually but we should be able to automate it with Vagrantfile. 

 

6) Run Squall Query: 

 
Inside Squall folder, you can run the Hyracks query by: 
 
storm-0.9.3/bin/storm jar deploy/squall-0.2.0-standalone.jar ch.epfl.data.sql.main.ParserMain test/squall/confs/cluster/0_01G_hyracks
 
The running topology will be displayed in Storm UI. The Topology will be killed after all data has been processed. The logs file can be viewed at /opt/storm/logs in the VMs.