How to use Spark on Grid5000 : Différence entre versions

De BIGDATA
Aller à : navigation, rechercher
(Page créée avec « Bienvenue sur la page Grid5000 »)
 
m (Modification accès aux données du HDFS)
 
(20 révisions intermédiaires par 3 utilisateurs non affichées)
Ligne 1 : Ligne 1 :
Bienvenue sur la page Grid5000
+
 
 +
== How to use Apache Spark on Grid5000 ==
 +
 
 +
 
 +
'''1 : Install hadoop_g5k'''
 +
https://github.com/mliroz/hadoop_g5k/wiki
 +
 
 +
Create file .bash_profile if it doesn't exist at '''/home/yourUserName/.bash_profile'''
 +
 
 +
Add the following lines :
 +
    PATH="/home/yourUserName/.local/bin:$PATH”
 +
    export PATH
 +
 
 +
'''Initialize cluster'''
 +
 
 +
'''Reserve nodes'''
 +
 
 +
https://www.grid5000.fr/mediawiki/index.php/Getting_Started
 +
 
 +
Some examples
 +
 
 +
    oarsub -t allow_classic_ssh -l nodes=10,walltime=2 -r '2015-06-14 19:30:00'
 +
 
 +
    oarsub -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12 -r '2015-07-09 21:14:01'
 +
 
 +
    oarsub -I -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12
 +
 
 +
'''Take a reservation'''
 +
 
 +
    oarsub -C job_ID
 +
 
 +
'''Take nodes directly'''
 +
 
 +
    oarsub -I -t allow_classic_ssh -l nodes=6,walltime=2
 +
 
 +
'''Known issue'''
 +
 
 +
It is possible that trying to run the hg5k and spark_g5k scripts while in interactive reservation mode raises a "DistributionNotFound" python error.
 +
In this case, you should try adding the keywork "deploy" to your reservation command.
 +
for example:
 +
    oarsub -I -t allow_classic_ssh -l nodes=2,walltime=0:30 deploy
 +
 
 +
'''Cluster initialization'''
 +
 
 +
https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k
 +
 
 +
Prerequisite : Depending on which cluster you are, you need to have the compressed versions of Hadoop and Spark in one of your directory, here public.
 +
 
 +
    hg5k --create $OAR_NODEFILE --version 2
 +
 
 +
    hg5k --bootstrap /home/yourUserName/public/hadoop-2.6.0.tar.gz
 +
 
 +
    hg5k --initialize feeling_lucky --start
 +
 
 +
    spark_g5k --create YARN --hid 1
 +
 
 +
    spark_g5k --bootstrap /home/yourUserName/public/spark-1.6.0-bin-hadoop2.6.tgz
 +
 
 +
    spark_g5k --initialize feeling_lucky --start
 +
 
 +
 
 +
'''Put files on HDFS'''
 +
 
 +
    hg5k --putindfs myfile.csv /myfile.csv
 +
 
 +
'''Execute jar file'''
 +
 
 +
    spark_g5k --scala_job myprgm.jar
 +
    spark_g5k --scala_job --exec_params executor-memory=1g driver-memory=1g num-executors=2 executor-cores=3 myprgm.jar
 +
 
 +
'''Find files on HDFS'''
 +
 
 +
    hg5k --state files
 +
 
 +
'''Get result file named res'''
 +
 
 +
    hg5k --getfromdfs /res .
 +
 
 +
 
 +
'''Destroy properly cluster'''
 +
    spark_g5k --delete
 +
    hg5k --delete
 +
 
 +
 
 +
'''Accesory'''
 +
List of resources of your reservation
 +
    uniq $OAR_NODEFILE

Version actuelle en date du 1 avril 2016 à 11:52

How to use Apache Spark on Grid5000

1 : Install hadoop_g5k https://github.com/mliroz/hadoop_g5k/wiki

Create file .bash_profile if it doesn't exist at /home/yourUserName/.bash_profile

Add the following lines :

   PATH="/home/yourUserName/.local/bin:$PATH”
   export PATH

Initialize cluster

Reserve nodes

https://www.grid5000.fr/mediawiki/index.php/Getting_Started

Some examples

   oarsub -t allow_classic_ssh -l nodes=10,walltime=2 -r '2015-06-14 19:30:00'
   oarsub -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12 -r '2015-07-09 21:14:01'
   oarsub -I -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12

Take a reservation

   oarsub -C job_ID

Take nodes directly

   oarsub -I -t allow_classic_ssh -l nodes=6,walltime=2

Known issue

It is possible that trying to run the hg5k and spark_g5k scripts while in interactive reservation mode raises a "DistributionNotFound" python error. In this case, you should try adding the keywork "deploy" to your reservation command. for example:

   oarsub -I -t allow_classic_ssh -l nodes=2,walltime=0:30 deploy

Cluster initialization

https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k

Prerequisite : Depending on which cluster you are, you need to have the compressed versions of Hadoop and Spark in one of your directory, here public.

   hg5k --create $OAR_NODEFILE --version 2
   hg5k --bootstrap /home/yourUserName/public/hadoop-2.6.0.tar.gz
   hg5k --initialize feeling_lucky --start
   spark_g5k --create YARN --hid 1
   spark_g5k --bootstrap /home/yourUserName/public/spark-1.6.0-bin-hadoop2.6.tgz
   spark_g5k --initialize feeling_lucky --start


Put files on HDFS

   hg5k --putindfs myfile.csv /myfile.csv

Execute jar file

   spark_g5k --scala_job myprgm.jar
   spark_g5k --scala_job --exec_params executor-memory=1g driver-memory=1g num-executors=2 executor-cores=3 myprgm.jar

Find files on HDFS

   hg5k --state files

Get result file named res

   hg5k --getfromdfs /res .


Destroy properly cluster

   spark_g5k --delete
   hg5k --delete


Accesory List of resources of your reservation

   uniq $OAR_NODEFILE