Tutorial: Setup Storm cluster on Azure Linux VMs

This guide details the step to manually set up a Storm cluster on Microsoft Azure.
 
I) HDInsight Storm and why not use it:
Microsoft Azure offer the so-called HDInsight services which enable quick deployment of clusters of big data platform such as Hadoop, HBase and Storm. However, there are certain quirks: 
- The cluster machines run on Windows. It appears to have a preview option with Ubuntu but this option is disabled at the moment. 
- The cost is expensive. The charge is based on “computing hours” which means as long as the cluster is up, cost incurred. For around 2 days with 1-node cluster doing nothing, it costed ≈ 50chf. 
- There is no stop and suspend operation on the HDInsight cluster. The only way to stop the cluster is to delete it which is not reasonable given the amount of time it takes for re-provisioning of the cluster (20 - 30 mins). Also local data can’t be persisted after “suspending” the cluster in this way. 
- Not flexible. (i.e, open ports, create squall tmp folder, submit jars…)
 
So creating the Storm cluster by deploying virtual machines on Azure and configuring them is the better alternative. Especially a VM can be shutdown and doesn’t bill for stop time is great for poor students.
 
II) Setup cluster Infrastructure on Azure
This setup is: 
- Use Ubuntu 12.04 LTS (precise)
- Use 4 machines: one for Zookeeper, three for Storm cluster with one nimbus (master) and two supervisors (slave). More machines can be added as supervisors to the Storm cluster or Zookeeper cluster. 
 
1) Create a storage account: 
In Azure Portal left panel-> Storage -> New -> Quick create to create a storage account. For this tutorial, select “West Europe”. The virtual machine that uses this storage account has to be in the same region. 
For a production env, storage accounts should be created across different regions for high availability. 
 
2) Cloud services
Azure Cloud Service is similar to a web gateway with load balancing, port forwarding and it also has a configurable public Domain Name.
On Azure, every VMs has to be configured with a Cloud Service while we don't need all the VM to serve public requests. For example, we will need Storm UI, Nimbus to be accessible outside of the virtual network but Zookeeper and supervisors machine can stay internal. 
 
For this tutorial, we will create a cloud service for each machine for the convenience of being able to ssh to each machine. We will do it when we create the VMs. Nothing to do here.(For production env, instead of open ssh port on VMs’ cloud services, we should configure vpn or providing access from certain machine only). 
 
3) Virtual Network: 
In order for VMs to communicate to each other internally, they have to be in the same virtual network. 
In Azure Portal left panel -> Network -> New -> Create a virtual network. Use the default IP range. And select the “West Europe” region. 
For this tutorial, we don’t need to configure DNS server or VPN.
 
4) Virtual Machines:
In Azure Portal left panel -> Virtual Machines -> New -> From Galleries -> Select Ubuntu 12.04. Follow the steps to create the machine: 
- In the region/network option, select the virtual network created above. 
- Specify a username with password or a cert (Here we use the default name “azureuser”) 
- Open the necessary ports described below:
 
In this tutorial we created 4 machines:
Name: nimbus1
Cloud service DNS: efnimbus1.cloudapp.net
Port Forwarding: 22 -> 22 (for SSH), 28080 -> 8080 (for Storm UI), 6627 -> 6627 (for Nimbus Thrift server)
 
Name: zookeeper1
CloudService DNS: efzookeeper1.cloudapp.net
Port: 22 -> 22
 
Similarly for other 2 machines: supervisor1 and supervisor2. We also open port 22 for these machines. 
 
III) Storm Installation
 
1) Networking:
As the VMs are in the same virtual network, they can communicate with each other. 
Check the Nimbus1 and Zookeeper1 machines’ IP addresses and add to the /etc/hosts files of all machines: 
 
10.0.0.4 zookeeper1
10.0.0.5 nimbus1
 
Ideally, we should create a DNS server and specify it when creating the virtual network. 
 
2) Setup ZooKeeper machine
 
ssh to the machine and provide the password:
ssh azureuser@efzookeeper1.cloudapp.net
 
Install Zookeeper: 
In this tutorial we install Zookeeper from Cloudera’s repository. Alternatively, it can be downloaded and installed from Apache site. See: http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html
 
- To add cloudera repository and install Zookeeper: 
wget http://archive.cloudera.com/cdh4/one-click-install/precise/amd64/cdh4-repository_1.0_all.deb
sudo dpkg -i cdh4-repository_1.0_all.deb
sudo apt-get update
sudo apt-get install zookeeper zookeeper-server
(The current zookeeper distribution version is 3.4.5)
 
- Configure Zookeeper in /etc/zookeeper/conf/zoo.cfg. The configuration is simple for a standalone Zookeeper node. We can use the default configuration. Add in the below 2 lines for purging logs (important for production system). 
# Enable regular purging of old data and transaction logs every 24 hours
autopurge.purgeInterval=24
autopurge.snapRetainCount=5
 
- For a multi server setup, refer to: http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_zkMulitServerSetup
 
- After the fresh installation, we need to initialize the zookeeper-server
sudo service zookeeper-server init
More Details at: http://www.cloudera.com/content/cloudera/en/documentation/cdh4/v4-5-0/CDH4-Installation-Guide/cdh4ig_topic_21_3.html
 
- Start Zookeeper-server
sudo service zookeeper-server start
 
- To verify:
echo ruok | nc localhost 2181
echo stat | nc localhost 2181
 
It is recommended to run Zookeeper on supervision: http://zookeeper.apache.org/doc/r3.4.6/zookeeperAdmin.html#sc_supervision
The end of this tutorial will cover how to run Zookeeper with supervision. 
 
3) Setup Storm Machine
 
ssh to the machine: ssh azureuser@efnimbus1.cloudapp.net
 
Install Java from Oracle repository
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get upgrade
sudo apt-get install oracle-java7-installer
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
 
Create “storm” to run storm service
sudo groupadd -g 53001 storm
sudo useradd -u 53001 -g 53001 -s /bin/bash storm
 
Install Storm to /opt/storm: 
 
cd /tmp/
wget http://mirror.easyname.ch/apache/storm/apache-storm-0.9.3/apache-storm-0.9.3.tar.gz
tar -xzvf apache-storm-0.9.3.tar.gz
sudo mv apache-storm-0.9.3 /opt/
cd /opt/
sudo ln -sv apache-storm-0.9.3 storm
 
Storm Configuration: 
- Create local working directory for storm
sudo mkdir -p /app/storm
sudo chown -R storm:storm /app/storm
sudo chmod 750 /app/storm
 
- Configure storm at /opt/storm/confs/storm.yaml. The important configurations are: 
     - Specify zookeeper host
     - Specify nimbus host
     - Use Netty transport for storm messaging, which is available since storm 0.9 and faster than the ZeroMq used in previous version: http://storm.apache.org/2013/12/08/storm090-released.html
 
storm.zookeeper.servers:
- zookeeper1
nimbus.host: "nimbus1"
storm.local.dir: "/app/storm"
storm.local.hostname: "nimbus1"
...
# Messaging backend for inter-task communication
# "backtype.storm.messaging.netty.Context" -- use Netty
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
 
- The full configuration files are provided at appendix. 
 
 
Create writable logs folder for storm user: By default, the logs folder in storm installation folder Or set this -Dlogfile.name= in storm.yaml:
 
sudo mkdir /opt/storm/logs
chmod 777 /opt/storm/logs
 
Manual test the installation:
sudo su storm
# for nimbus
/opt/storm/bin/storm nimbus
/opt/storm/bin/storm ui
# for supervisor
/opt/storm/bin/storm supervisor
 
It is fine if no exception is thrown.
 
Storm UI can be accessed at port 28080 as it is forwarded to port 8080 of Nimbus1: 
http://efnimbus1.cloudapp.net:28080/index.html
 
 
4) Running Storm and Zookeeper daemons under supervison: 
 
"It is critical that you run each of these daemons under supervision. Storm is a fail-fast system which means the processes will halt whenever an unexpected error is encountered. Storm is designed so that it can safely halt at any point and recover correctly when the process is restarted. This is why Storm keeps no state in-process -- if Nimbus or the Supervisors restart, the running topologies are unaffected."
 
There are many supervision tools. We use supervisor here http://supervisord.org/installing.html (Note that it is different from Storms supervisor).
In every VMs run: 
 
sudo apt-get install supervisor
 
We will use supervisor to start the necessary daemons. they are storm nimbus, storm ui in Nimbus1; storm supervisor in Supervisor1 and Supervisor2, zookeeper-server in Zookeeper1.
 
Prepare the logging folder for the supervisor tool:
# on storm machines:
sudo mkdir -p /var/log/supervisor/storm
sudo chown -R storm:storm /var/log/supervisor/storm
# on zookeeper machine:
sudo mkdir -p /var/log/supervisor/zookeeper
sudo chown -R zookeeper:zookeeper /var/log/supervisor/zookeeper
 
The daemon configurations can be created as *.conf files and put in /etc/supervisor/conf.d/. Supervisor tool will starts these daemons when it is started. For example, in Zookeeper machine, we create: 
 
sudo vi /etc/supervisor/conf.d/zookeeper.conf
 
The configuration: 
[program:zookeeper]
command=/usr/bin/zookeeper-server start-foreground
...
stdout_logfile=/var/log/supervisor/zookeeper/zookeeper.out
 
For full files, refer to the Appendix. Restart the supervisor service:
 
sudo service supervisor stop
sudo service supervisor start
# check running daemon by this command
sudo supervisorctl status
 
Again, to verify that everything works, storms supervisors register properly:
http://efnimbus1.cloudapp.net:28080/index.html
 
5) Submit Storm Application: 
To submit a storm application from a client machine, you must have a local storm installation on that machine. 
Create a the file: ~/.storm/storm.yaml with the nimbus host conf:
 
nimbus.host: efnimbus1.cloudapp.net"
 
Then run the bin/storm jar $PATH_TO_JAR_FILE command to execute the job and submit the jar file to Nimbus. 
 
 
IV) Software Update: 
The update task depends on what update is required. Most of the installation here is performed via package manager, therefore, updating these package should be straightforward. 
 
For updating Storm to a newer version, simply change the /opt/storm simlink to another storm installatio
 
References:
https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html
http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/
 
 
V) APPENDIX: Configuration Files:
1) Zookeeper.conf 
/etc/zookeeper/conf/zoo.cfg
tickTime=2000
dataDir=/var/lib/zookeeper
clientPort=2181
autopurge.purgeInterval=24
autopurge.snapRetainCount=5
 
2) Storm configuration files: 
$STORM_INSTALLATION/conf/storm.yaml
 
storm.zookeeper.servers:
- zookeeper1
 
nimbus.host: "nimbus1"
storm.local.dir: "/app/storm"
 
drpc.childopts: "-Xmx256m -Djava.net.preferIPv4Stack=true"
logviewer.childopts: "-Xmx128m -Djava.net.preferIPv4Stack=true"
nimbus.childopts: "-Xmx1024m -Djava.net.preferIPv4Stack=true"
ui.childopts: "-Xmx512m -Djava.net.preferIPv4Stack=true"
supervisor.childopts: "-Xmx512m -Djava.net.preferIPv4Stack=true"
worker.childopts: "-Xmx1536m -Djava.net.preferIPv4Stack=true"
 
# Define the amount of workers that can be run on this machine.
# Each worker is assigned a port to use for communication.
supervisor.slots.ports:
- 6700
- 6701
- 6702
- 6703
 
# Messaging backend for inter-task communication
# "backtype.storm.messaging.netty.Context" -- use Netty
storm.messaging.transport: "backtype.storm.messaging.netty.Context"
 
3) Supervisor configuration files: 
 
/etc/supervisor/conf.d/storm-nimbus.conf
[program:storm-nimbus]
command=/opt/storm/bin/storm nimbus
numprocs=1
numprocs_start=0
priority=999
autostart=true
autorestart=true
startsecs=10
startretries=999
exitcodes=0,2
stopsignal=KILL
stopwaitsecs=10
stopasgroup=false
directory=/
user=storm
redirect_stderr=false
stdout_logfile=/var/log/supervisor/storm/nimbus.out
stdout_logfile_maxbytes=20MB
stdout_logfile_backups=5
stderr_logfile=/var/log/supervisor/storm/nimbus.err
stderr_logfile_maxbytes=20MB
stderr_logfile_backups=10
environment=
 
/etc/supervisor/conf.d/storm-ui.conf
[program:storm-ui]
command=/opt/storm/bin/storm ui
numprocs=1
numprocs_start=0
priority=999
autostart=true
autorestart=true
startsecs=10
startretries=999
exitcodes=0,2
stopsignal=TERM
stopwaitsecs=10
stopasgroup=false
directory=/
user=storm
redirect_stderr=false
stdout_logfile=/var/log/supervisor/storm/ui.out
stdout_logfile_maxbytes=20MB
stdout_logfile_backups=5
stderr_logfile=/var/log/supervisor/storm/ui.err
stderr_logfile_maxbytes=20MB
stderr_logfile_backups=10
environment=
 
/etc/supervisor/conf.d/storm-supervisor.conf
[program:storm-supervisor]
command=/opt/storm/bin/storm supervisor
numprocs=1
numprocs_start=0
priority=999
autostart=true
autorestart=true
startsecs=10
startretries=999
exitcodes=0,2
stopsignal=KILL
stopwaitsecs=10
stopasgroup=true
directory=/
user=storm
redirect_stderr=false
stdout_logfile=/var/log/supervisor/storm/supervisor.out
stdout_logfile_maxbytes=20MB
stdout_logfile_backups=5
stderr_logfile=/var/log/supervisor/storm/supervisor.err
stderr_logfile_maxbytes=20MB
stderr_logfile_backups=10
environment=
 
/etc/supervisor/conf.d/zookeeper.conf
[program:zookeeper]
command=/usr/bin/zookeeper-server start-foreground
numprocs=1
numprocs_start=0
priority=999
autostart=true
autorestart=true
startsecs=10
startretries=999
exitcodes=0,2
stopsignal=INT
stopwaitsecs=10
stopasgroup=true
directory=/
user=zookeeper
redirect_stderr=false
stdout_logfile=/var/log/supervisor/zookeeper/zookeeper.out
stdout_logfile_maxbytes=20MB
stdout_logfile_backups=5
stderr_logfile=/var/log/supervisor/zookeeper/zookeeper.err
stderr_logfile_maxbytes=20MB
stderr_logfile_backups=10