Spark - bwHPC Wiki Spark - bwHPC Wiki

Spark

From bwHPC Wiki
Jump to: navigation, search

1 Introduction to Spark

Apache Spark is a framework for large-scale data processing. The project Spark is created by University of California, Berkeley’s AMP Lab and later donated to Apache Software Foundation.

Hadoop was the first OpenSource Implementation of MapReduce after the Google Em- ployees published their Paper about MapReduce. It was and still is a good implemen- tation if you want to do batch processing of your data. It comes with the Hadoop File System (HDFS), which is a distributed file system which is widely used with big data tools. Hadoop can make use of its knowledge about the distributed file-system, this means Hadoop will execute the Map function on the nodes accordingly to its knowledge where the part of the file is stored, that should be processed. With this Hadoop can reduce the file transfers between the nodes.

Spark is sometimes called a successor to Hadoop but it is more a addition tp Hadoop, because it adds the ability to process streamed data, not only batch data like Hadoop. It does its processing Resillient Distributed Datasets (RDD), which can be stored not only on (HDFS), but also in databases or receive them from message queues like Kafka. With this Spark is widely more capable than Hadoop, because it can process transformations on RDDs, compared to only file system accesses on Hadoop.

1.1 Implementation of Apache Spark

The test program runs on the BWUniCluster, so an account on BWUniCluster is needed. The latest version of Apache Spark could be downloaded from the official website of Apache Spark. There is no Hadoop on BWUniCluster, therefore, the file “Spark 2.1.1 Pre-built for Apache Hadoop 2.7 and later” should be installed.


Once login in BWUniCluster, user can visit the root of Spark folder. Firstly, five different nodes are requested by the following command, one node as master, three nodes as slaves, the last node as command window.

msub -I -V -l nodes=1:ppn=1 -l walltime=0:02:00:00

Secondly, to use scripts to start master and slaves, and remember the host name of the master node. The script “start-slave.sh” needs the address parameter to connect with master. The two commands are showed here:

./sbin/start-master.sh

./sbin/start-slave.sh spark://uc1n2xx.localdomain:7077

If scripts run successfully, the terminal shows the name of log files. The logs can be read through kinds of methods, such as visual editor (VI). From the log file it’s clear that the master starts from spark://uc1n251.localdomain with port 7077 on bwUniCluster, and three slaves are registered successfully on master.

1.2 Run the test

Now Apache Spark is ready for submitting jobs from user. An example 'JavaWordCount' is used for test. You can find this example program in the downloads. This program counts the appearance frequency of each word in the document (test.txt). Following is the content of the document.

Hello world! It is a test for Spark on bwUniCluster.

Finally, submit the task to Spark with following command.

./bin/run-example --master spark://uc1n2xx.localdomain:7077 JavaWordCount ./test.txt

The appearance frequency of each word is 1, like what test.txt shows.