Spark

Installation

Spark can be downloaded from the official website. One important advice is to use Scala 2.10 instead of the latest 2.11, as Spark is currently quite buggy with Scala 2.11.

Spark can be started and stopped simply by running the scripts "$SPARK_HOME/sbin/start-all.sh" and "$SPARK_HOME/sbin/stop-all.sh"

Accessing data on HDFS

We first need to get the addresses of the Hadoop and Spark masters, which can be found on their respective pages:

For Hadoop: http://localhost:50070/
For Spark: http://localhost:8080/

In my case, these are "localhost:9000" and "spark://ubuntu:7077".

From the Spark shell

By using the Spark shell, we can access data on HDFS like this:

spark-shell --master <spark master address>
val lines = sc.textFile("hdfs://<hadoop master address>/<path>")

From an application

Accessing data from an application is a bit more tedious. One easy way to do this is by using sbt which can be downloaded here. Then, you need to specify a build.sbt file like this in your project folder:

name := "pName"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.3.0"

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

You can then build the project, create a jar and execute it with the spark-submit command:

sbt compile
sbt package
spark-submit --class Main --master <spark master address> target/scala-2.10/pname_2.10-1.0.jar

One last thing that should be noted is that, because we set the master in the spark-submit command, there is no need to set it in the code of the application. So, the application can just start with:

val conf = new SparkConf().setAppName("AFancyAppName")
val sc = new SparkContext(conf)
val lines = sc.textFile("hdfs://<hadoop master address>/<path>")

This wiki
- Home
- Sitemap
- Files
- New page
- Administration
This page
- Edit
- Clean
- Delete
- History
- Print
- Comments (0)
Share

Prospective students portal

Students portal

Researchers portal

Staff portal

Business portal

Mediacorner

Teaching portal

EPFL Alumni Portal

Architecture, Civil and Environmental Engineering ENAC

Basic Sciences SB

Engineering STI

Computer and Communication Sciences IC

Life Sciences SV

Management of Technology CDM

College of Humanities CDH

EPFL

Education

Research

Innovation & Tech Transfer

EPFL Campus

Spark

Installation

Accessing data on HDFS

From the Spark shell

From an application