How to use Spark on Grid5000

Aller à : navigation, rechercher

Welcome to spark on Grid5000

1 : Install hadoop_g5k

Create file .bash_profile if it doesn't exist at /home/yourUserName/.bash_profile

Add the following lines :

   export PATH

Initialize cluster

Reserve nodes

Some examples

   oarsub -t allow_classic_ssh -l nodes=10,walltime=2 -r '2015-06-14 19:30:00'
   oarsub -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12 -r '2015-07-09 21:14:01'
   oarsub -I -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12

Take a reservation

   oarsub -C job_ID

Take nodes directly

   oarsub -I -t allow_classic_ssh -l nodes=6,walltime=2

Cluster initialization

Prerequisite : Depending on which cluster you are, you need to have the compressed versions of Hadoop and Spark in one of your directory, here public.

   hg5k --create $OAR_NODEFILE --version 2
   hg5k --bootstrap /home/yourUserName/public/hadoop-2.6.0.tar.gz
   hg5k --initialize feeling_lucky --start
   spark_g5k --create YARN --hid 1
   spark_g5k --bootstrap /home/yourUserName/public/spark-1.6.0-bin-hadoop2.6.tgz
   spark_g5k --initialize feeling_lucky --start

Put files on HDFS

   hg5k --putindfs myfile.csv /myfile.csv

Execute jar file

   spark_g5k --scala_job myprgm.jar
   spark_g5k --scala_job --exec_params executor-memory=1g driver-memory=1g num-executors=2 executor-cores=3 myprgm.jar

Find files on HDFS

   hg5k --state files

Get result file named res

   hg5k --getfromdfs res /home/yourUserName/reims

Destroy properly cluster

   spark_g5k --delete
   hg5k --delete


  1. list of resources of your reservation