Différences

Ci-dessous, les différences entre deux révisions de la page.

--- public:support:labex-efl-gpu:tal-labex-gpu [2021/11/30 15:27]
garciaflores créée
+++ public:support:labex-efl-gpu:tal-labex-gpu [2023/01/27 16:57] (Version actuelle)
garciaflores
@@ Ligne 1: / Ligne 1: @@
 # Accesing the TAL/LABEX EFL GPU server
+The server gives access to **8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each** in one node. This server is reserved to external @LipnLab [LABEX EFL](https://www.labex-efl.fr/) research partners. You need to [send us an email](mailto:jgflores@lipn.fr) to ask for a `tal-lipn` account in order to get access to this server.
-- **`lipn-tal-labex`**  provides access to **8 GPUs Nvidia GEForce RTX 2080 with 8GB of RAM each** in one node. You need to write an email  to [Jorge Garcia Flores](mailto:jgflores@lipn.fr) to ask for a `tal-lipn` account in order to get access to this server. (Spoiler alert: for the moment, a standard LIPN intranet account  is useless for this server: you really need a `tal-lipn` account to gain access this part of the network).
+## 1. Connecting to the server
-## 1. Connecting to the Labex EFL server
 You can connect through `ssh` protocol. (More help on ssh commands [here](https://www.ssh.com/academy/ssh/command). )
@@ Ligne 11: / Ligne 8: @@
 If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful [Villetaneuse](https://www.mairie-villetaneuse.fr/visite-guidee)) type the following command :
-```bash
+<code bash>
 # INSIDE the lab command
 $ ssh user_name@ssh.tal.lipn.univ-paris13.fr
-```
+</code>
 Otherwise, if you are outside the LIPN, you should connect with the following command:
-```bash
+<code bash>
 # OUTSIDE the lab command
 $ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr
-```
+</code>
 After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu
-```bash
+<code bash>
 ###################################################################
 # Bienvenue sur le cluster TAL                                    #
@@ Ligne 47: / Ligne 44: @@
 ) Sdmc            14) GPU2           21) CheneTAL
 Votre choix : 20
-```
+</code>
 Press enter. You should see the following message:
-```bash
+<code bash>
 Warning: Permanently added '192.168.69.77' (ECDSA) to the list of known hosts.
 Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
@@ Ligne 67: / Ligne 64: @@
 Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3
 user_name@tal-gpu-login:~$
-```
+</code>
-**For the moment, you need to source manually the `.bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment  (see section 3). So, each time you login, you need to type**
+**For the moment, you need to source manually the `‧bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment  (see section 3). So, each time you login, you need to type**
-```bash
+<code bash>
 user_name@tal-gpu-login:~$ source .bashrc
-```
+</code>
 Once you install *miniconda* and *Pytorch* (after you run section 3 commands), you will see the base miniconda prompt.
-```bash
+<code bash>
 (base) user_name@tal-gpu-login:~$
-```
+</code>
@@ Ligne 89: / Ligne 86: @@
 We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad:
-```bash
+<code bash>
 # INSIDE the lab commands
 # copying one file from your computer to the Labex server
-$ scp my_file.txt user_name@ssh.tal.univ-paris13.fr:~/
+$ scp my_file‧txt user_name@ssh.tal.univ-paris13.fr:~/
 # copying a whole folder
 $ scp -r local_folder user_name@ssh.tal.univ-paris13.fr:~/remote_folder
-```
+</code>
 To copy data back to your local computer :
-```bash
+<code bash>
 # INSIDE the lab commands
 my_user@my_local_computer:~$ scp user_name@ssh.tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
-```
+</code>
 And from outside the lab :
-```bash
+<code bash>
 # OUTSIDE the lab commands
 # copying files
-$ scp -P 60022 my_file.txt user_name@lipnssh.univ-paris13.fr:~/
+$ scp -P 60022 my_file‧txt user_name@tal.lipn.univ-paris13.fr
 # copying folders recursevly
-$ scp -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder
+$ scp -P 60022 -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder
-```
+</code>
 Any data that you need to copy back from the server to your computer must be copied to your NFS home:
-```bash
+<code bash>
 #OUTSIDE the lab commands
-user_name@lipn-tal-labex:~$ cp any_file.txt /users/username/my_folder/
+user_name@lipn-tal-labex:~$ cp any_file‧txt /users/username/my_folder/
 user_name@lipn-tal-labex:~$ exit
 my_user@my_local_computer:~$ scp -P 60022 user_name@tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
-```
+</code>
@@ Ligne 128: / Ligne 125: @@
 Check the [Miniconda documentation](https://docs.conda.io/en/latest/miniconda.html) to get the link of the latest Linux 64-bit miniconda installer to use with the `wget`command.
-```bash
+<code bash>
 $ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
 $ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh
-```
+</code>
 The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed.
-```bash
+<code bash>
 Welcome to Miniconda3 py37_4.10.3
@@ Ligne 152: / Ligne 149: @@
 Do you accept the license terms? [yes|no]
 [no] >>> yes
-```
+</code>
 Choose the installation path on your NFS home
-```bash
+<code bash>
 Miniconda3 will now be installed into this location:
 /home/garciaflores/miniconda3
@@ Ligne 165: / Ligne 162: @@
 [/home/username/miniconda3] >>> /home/username/code/python/miniconda3
-```
+</code>
-Now you will be asked if you want to add *conda* base environment in your `.bashrc` file. Answer yes.
+Now you will be asked if you want to add *conda* base environment in your `‧bashrc` file. Answer yes.
-```bash
+<code bash>
 Preparing transaction: done
 Executing transaction: done
@@ Ligne 176: / Ligne 173: @@
 by running conda init? [yes|no]
 [no] >>> yes
-```
+</code>
-Source manually your `.bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch.
+Source manually your `‧bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch.
-```bash
+<code bash>
 user_name@lipn-tal-labex:~$ cd ~
 user_name@lipn-tal-labex:~$ source .bashrc
 (base) user_name@lipn-tal-labex:~$
-```
+</code>
@@ Ligne 192: / Ligne 189: @@
 Install the  [Pytorch](https://pytorch.org/) framework with in your base miniconda environment
-```bash
+<code bash>
 (base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
 [...]
@@ Ligne 204: / Ligne 201: @@
 Proceed ([y]/n)? y
-```
+</code>
 (Type `y`to proceed). After a while, you need to test your Pytorch install.
-To test it, create the following `gpu_test.py` program with your favorite editor
+To test it, create the following `gpu_test‧py` program with your favorite editor
-```python
+</code>python
 # Python program to count GPU cards in the server using Pytorch
 import torch
-available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
+available_gpus = [torch‧cuda‧device(i) for i in range(torch‧cuda‧device_count())]
 for gpu in available_gpus:
     print(gpu)
-```
+</code>
 Then run it with the *SLURM* command `srun`
-```bash
+<code bash>
-(base) user_name@lipn-tal-labex:~$ srun python3 gpu_test.py
+(base) user_name@lipn-tal-labex:~$ srun python3 gpu_test‧py
-<torch.cuda.device object at 0x7f29f0602d10>
+<torch‧cuda‧device object at 0x7f29f0602d10>
-<torch.cuda.device object at 0x7f29f0602d90>
+<torch‧cuda‧device object at 0x7f29f0602d90>
-<torch.cuda.device object at 0x7f29f0602e90>
+<torch‧cuda‧device object at 0x7f29f0602e90>
-<torch.cuda.device object at 0x7f29f0618cd0>
+<torch‧cuda‧device object at 0x7f29f0618cd0>
-<torch.cuda.device object at 0x7f29f0618d10>
+<torch‧cuda‧device object at 0x7f29f0618d10>
-<torch.cuda.device object at 0x7f29f0618d90>
+<torch‧cuda‧device object at 0x7f29f0618d90>
-<torch.cuda.device object at 0x7f29f0618dd0>
+<torch‧cuda‧device object at 0x7f29f0618dd0>
-<torch.cuda.device object at 0x7f29f0618e10>
+<torch‧cuda‧device object at 0x7f29f0618e10>
-```
+</code>
 ## 4. Using `slurm` to run your code
@@ Ligne 236: / Ligne 233: @@
 [Slurm](https://slurm.schedmd.com/overview.html) is the Linux workload manager we use at LIPN to schedule and queue GPU jobs.
-### `srun`
+### $ srun
 This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`.
-```bash
+<code bash>
 $ srun nvidia-smi
 Mon Nov 29 13:27:00 2021
@@ Ligne 268: / Ligne 265: @@
 +-----------------------------------------------------------------------------+
-```
+</code>
 **You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`**
-### `sinfo` and `scontrol`
+### $ sinfo / scontrol
 This command shows how many nodes are available in the server.
-```bash
+<code bash>
 $ sinfo
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 labex*       up   infinite      1    mix tal-gpu-labex1
-```
+</code>
 If you want to check the particular configuration of a node, use `scontrol`
-```bash
+<code bash>
 $ scontrol show node tal-gpu-labex1
 NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10
@@ Ligne 299: / Ligne 296: @@
    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
-```
+</code>
-### `squeue`
+### $ squeue
 If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with .
-```bash
+<code bash>
 $ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-     labex QKVRegLA ghazi.fe  R 2-23:50:30      1 tal-gpu-labex1
+     labex QKVRegLA ghazi‧fe  R 2-23:50:30      1 tal-gpu-labex1
-     labex QKVRegLA ghazi.fe  R 2-23:41:19      1 tal-gpu-labex1
+     labex QKVRegLA ghazi‧fe  R 2-23:41:19      1 tal-gpu-labex1
      labex MicrofEx gerardo.  R      24:31      1 tal-gpu-labex1
-```
+</code>
-### `sbatch`
+### $ sbatch
-If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test.py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test.py` example to use only 3 GPUs, and specifies output files for the job.
+If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test‧py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test‧py` example to use only 3 GPUs, and specifies output files for the job.
-First, you will create a `myfirst_gpu_job.sh` file
+First, you will create a `myfirst_gpu_job‧sh` file
-```bash
+<code bash>
 #!/usr/bin/env bash
 #SBATCH --job-name=MyFirstJob
@@ Ligne 324: / Ligne 321: @@
 #SBATCH --qos=qos_gpu-t4
 #SBATCH --cpus-per-task=5
-#SBATCH --output=./MyFirstJob.out
+#SBATCH --output=./MyFirstJob‧out
-#SBATCH --error=./MyFirstJob.err
+#SBATCH --error=./MyFirstJob‧err
 #SBATCH --time=100:00:00
 #SBATCH --nodes=1
 #SBATCH --cpus-per-task=5
 #SBATCH --ntasks-per-node=1
-srun python available_gpus.py
+srun python available_gpus‧py
-```
+</code>
-These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob.out`
+These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob‧out`
 Then you run the script with `sbatch`
-```bash
+<code bash>
-$ sbatch myfirst_gpu_job.sh
+$ sbatch myfirst_gpu_job‧sh
-```
+</code>
-### `scancel`
+### $ scancel
 From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command
-```bash
+<code bash>
 $ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
@@ Ligne 351: / Ligne 348: @@
       lipn my_first garciafl PD       0:00      1 (Priority)
       lipn SUPER_En   leroux PD       0:00      1 (Priority)
-      lipn    GGGS1 xudong.z  R 6-20:30:00      1 lipn-rtx1
+      lipn    GGGS1 xudong‧z  R 6-20:30:00      1 lipn-rtx1
       lipn SUPER_En   leroux  R 5-00:03:11      1 lipn-rtx2
       lipn SUPER_En   leroux  R 2-21:37:05      1 lipn-rtx2
 $ scancel 4759
-```
+</code>
+## Troubleshooting
+Any questions about this doc, write to [Jorge Garcia Flores](mailto:jgflores@lipn.fr).