Spark

Installation

Spark can be downloaded from the official website. One important advice is to use Scala 2.10 instead of the latest 2.11, as Spark is currently quite buggy with Scala 2.11.

Spark can be started and stopped simply by running the scripts "$SPARK_HOME/sbin/start-all.sh" and "$SPARK_HOME/sbin/stop-all.sh"

Accessing data on HDFS

We first need to get the addresses of the Hadoop and Spark masters, which can be found on their respective pages:

In my case, these are "localhost:9000" and "spark://ubuntu:7077".

From the Spark shell

By using the Spark shell, we can access data on HDFS like this:

spark-shell --master <spark master address>
val lines = sc.textFile("hdfs://<hadoop master address>/<path>")
From an application

Accessing data from an application is a bit more tedious. One easy way to do this is by using sbt which can be downloaded here. Then, you need to specify a build.sbt file like this in your project folder:

name := "pName"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

You can then build the project, create a jar and execute it with the spark-submit command:

sbt compile
sbt package
spark-submit --class Main --master <spark master address> target/scala-2.10/pname_2.10-1.0.jar

One last thing that should be noted is that, because we set the master in the spark-submit command, there is no need to set it in the code of the application. So, the application can just start with:

val conf = new SparkConf().setAppName("AFancyAppName")
val sc = new SparkContext(conf)
val lines = sc.textFile("hdfs://<hadoop master address>/<path>")