How to use Spark on Grid5000 : Différence entre versions
|  (Page créée avec « Bienvenue sur la page Grid5000 ») | m (Modification accès aux données du HDFS) | ||
| (20 révisions intermédiaires par 3 utilisateurs non affichées) | |||
| Ligne 1 : | Ligne 1 : | ||
| − | + | ||
| + | == How to use Apache Spark on Grid5000 == | ||
| + | |||
| + | |||
| + | '''1 : Install hadoop_g5k''' | ||
| + | 		https://github.com/mliroz/hadoop_g5k/wiki | ||
| + | |||
| + | Create file .bash_profile if it doesn't exist at '''/home/yourUserName/.bash_profile''' | ||
| + | |||
| + | Add the following lines : | ||
| + |     PATH="/home/yourUserName/.local/bin:$PATH” | ||
| + |     export PATH | ||
| + | |||
| + | '''Initialize cluster''' | ||
| + | |||
| + | '''Reserve nodes''' | ||
| + | |||
| + | https://www.grid5000.fr/mediawiki/index.php/Getting_Started | ||
| + | |||
| + | Some examples | ||
| + | |||
| + |     oarsub -t allow_classic_ssh -l nodes=10,walltime=2 -r '2015-06-14 19:30:00' | ||
| + | |||
| + |     oarsub -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12 -r '2015-07-09 21:14:01' | ||
| + | |||
| + |     oarsub -I -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12 | ||
| + | |||
| + | '''Take a reservation''' | ||
| + | |||
| + |     oarsub -C job_ID | ||
| + | |||
| + | '''Take nodes directly'''  | ||
| + | |||
| + |     oarsub -I -t allow_classic_ssh -l nodes=6,walltime=2 | ||
| + | |||
| + | '''Known issue''' | ||
| + | |||
| + | It is possible that trying to run the hg5k and spark_g5k scripts while in interactive reservation mode raises a "DistributionNotFound" python error. | ||
| + | In this case, you should try adding the keywork "deploy" to your reservation command. | ||
| + | for example: | ||
| + |     oarsub -I -t allow_classic_ssh -l nodes=2,walltime=0:30 deploy | ||
| + | |||
| + | '''Cluster initialization''' | ||
| + | |||
| + | https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k | ||
| + | |||
| + | Prerequisite : Depending on which cluster you are, you need to have the compressed versions of Hadoop and Spark in one of your directory, here public. | ||
| + | |||
| + |     hg5k --create $OAR_NODEFILE --version 2 | ||
| + | |||
| + |     hg5k --bootstrap /home/yourUserName/public/hadoop-2.6.0.tar.gz | ||
| + | |||
| + |     hg5k --initialize feeling_lucky --start | ||
| + | |||
| + |     spark_g5k --create YARN --hid 1 | ||
| + | |||
| + |     spark_g5k --bootstrap /home/yourUserName/public/spark-1.6.0-bin-hadoop2.6.tgz | ||
| + | |||
| + |     spark_g5k --initialize feeling_lucky --start | ||
| + | |||
| + | |||
| + | '''Put files on HDFS''' | ||
| + | |||
| + |     hg5k --putindfs myfile.csv /myfile.csv | ||
| + | |||
| + | '''Execute jar file''' | ||
| + | |||
| + |     spark_g5k --scala_job myprgm.jar | ||
| + |     spark_g5k --scala_job --exec_params executor-memory=1g driver-memory=1g num-executors=2 executor-cores=3 myprgm.jar | ||
| + | |||
| + | '''Find files on HDFS''' | ||
| + | |||
| + |     hg5k --state files | ||
| + | |||
| + | '''Get result file named res''' | ||
| + | |||
| + |     hg5k --getfromdfs /res . | ||
| + | |||
| + | |||
| + | '''Destroy properly cluster''' | ||
| + |     spark_g5k --delete | ||
| + |     hg5k --delete | ||
| + | |||
| + | |||
| + | '''Accesory''' | ||
| + | List of resources of your reservation | ||
| + |     uniq $OAR_NODEFILE | ||
Version actuelle en date du 1 avril 2016 à 11:52
How to use Apache Spark on Grid5000
1 : Install hadoop_g5k https://github.com/mliroz/hadoop_g5k/wiki
Create file .bash_profile if it doesn't exist at /home/yourUserName/.bash_profile
Add the following lines :
PATH="/home/yourUserName/.local/bin:$PATH” export PATH
Initialize cluster
Reserve nodes
https://www.grid5000.fr/mediawiki/index.php/Getting_Started
Some examples
oarsub -t allow_classic_ssh -l nodes=10,walltime=2 -r '2015-06-14 19:30:00'
oarsub -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12 -r '2015-07-09 21:14:01'
oarsub -I -p "cluster='paranoia'" -t allow_classic_ssh -l nodes=8,walltime=12
Take a reservation
oarsub -C job_ID
Take nodes directly
oarsub -I -t allow_classic_ssh -l nodes=6,walltime=2
Known issue
It is possible that trying to run the hg5k and spark_g5k scripts while in interactive reservation mode raises a "DistributionNotFound" python error. In this case, you should try adding the keywork "deploy" to your reservation command. for example:
oarsub -I -t allow_classic_ssh -l nodes=2,walltime=0:30 deploy
Cluster initialization
https://github.com/mliroz/hadoop_g5k/wiki/spark_g5k
Prerequisite : Depending on which cluster you are, you need to have the compressed versions of Hadoop and Spark in one of your directory, here public.
hg5k --create $OAR_NODEFILE --version 2
hg5k --bootstrap /home/yourUserName/public/hadoop-2.6.0.tar.gz
hg5k --initialize feeling_lucky --start
spark_g5k --create YARN --hid 1
spark_g5k --bootstrap /home/yourUserName/public/spark-1.6.0-bin-hadoop2.6.tgz
spark_g5k --initialize feeling_lucky --start
Put files on HDFS
hg5k --putindfs myfile.csv /myfile.csv
Execute jar file
spark_g5k --scala_job myprgm.jar spark_g5k --scala_job --exec_params executor-memory=1g driver-memory=1g num-executors=2 executor-cores=3 myprgm.jar
Find files on HDFS
hg5k --state files
Get result file named res
hg5k --getfromdfs /res .
Destroy properly cluster
spark_g5k --delete hg5k --delete
Accesory
List of resources of your reservation
uniq $OAR_NODEFILE
