public:support:labex-efl-gpu:tal-labex-gpu

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Prochaine révision
Révision précédente
public:support:labex-efl-gpu:tal-labex-gpu [2021/11/30 15:27]
garciaflores créée
public:support:labex-efl-gpu:tal-labex-gpu [2023/01/27 16:57] (Version actuelle)
garciaflores
Ligne 1: Ligne 1:
 # Accesing the TAL/LABEX EFL GPU server # Accesing the TAL/LABEX EFL GPU server
 +The server gives access to **8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each** in one node. This server is reserved to external @LipnLab [LABEX EFL](https://www.labex-efl.fr/) research partners. You need to [send us an email](mailto:jgflores@lipn.fr) to ask for a `tal-lipn` account in order to get access to this server. 
  
-- **`lipn-tal-labex`**  provides access to **8 GPUs Nvidia GEForce RTX 2080 with 8GB of RAM each** in one node. You need to write an email  to [Jorge Garcia Flores](mailto:jgflores@lipn.fr) to ask for a `tal-lipn` account in order to get access to this server. (Spoiler alert: for the moment, a standard LIPN intranet account  is useless for this server: you really need a `tal-lipn` account to gain access this part of the network). +## 1. Connecting to the server
- +
- +
- +
-## 1. Connecting to the Labex EFL server+
  
 You can connect through `ssh` protocol. (More help on ssh commands [here](https://www.ssh.com/academy/ssh/command). ) You can connect through `ssh` protocol. (More help on ssh commands [here](https://www.ssh.com/academy/ssh/command). )
Ligne 11: Ligne 8:
 If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful [Villetaneuse](https://www.mairie-villetaneuse.fr/visite-guidee)) type the following command : If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful [Villetaneuse](https://www.mairie-villetaneuse.fr/visite-guidee)) type the following command :
  
-```bash+<code bash>
 # INSIDE the lab command # INSIDE the lab command
 $ ssh user_name@ssh.tal.lipn.univ-paris13.fr $ ssh user_name@ssh.tal.lipn.univ-paris13.fr
-```+</code>
 Otherwise, if you are outside the LIPN, you should connect with the following command:  Otherwise, if you are outside the LIPN, you should connect with the following command: 
  
-```bash+<code bash>
 # OUTSIDE the lab command # OUTSIDE the lab command
 $ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr $ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr
-```+</code>
  
  
 After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu
  
-```bash+<code bash>
 ################################################################### ###################################################################
 # Bienvenue sur le cluster TAL                                    # # Bienvenue sur le cluster TAL                                    #
Ligne 47: Ligne 44:
 7) Sdmc            14) GPU2           21) CheneTAL 7) Sdmc            14) GPU2           21) CheneTAL
 Votre choix : 20 Votre choix : 20
-```+</code>
  
 Press enter. You should see the following message:  Press enter. You should see the following message: 
  
-```bash+<code bash>
 Warning: Permanently added '192.168.69.77' (ECDSA) to the list of known hosts. Warning: Permanently added '192.168.69.77' (ECDSA) to the list of known hosts.
 Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
Ligne 67: Ligne 64:
 Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3 Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3
 user_name@tal-gpu-login:~$ user_name@tal-gpu-login:~$
-```+</code>
  
-**For the moment, you need to source manually the `.bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment  (see section 3). So, each time you login, you need to type**+**For the moment, you need to source manually the `bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment  (see section 3). So, each time you login, you need to type**
  
-```bash+<code bash>
 user_name@tal-gpu-login:~$ source .bashrc user_name@tal-gpu-login:~$ source .bashrc
-```+</code>
  
 Once you install *miniconda* and *Pytorch* (after you run section 3 commands), you will see the base miniconda prompt.  Once you install *miniconda* and *Pytorch* (after you run section 3 commands), you will see the base miniconda prompt. 
  
-```bash+<code bash>
 (base) user_name@tal-gpu-login:~$ (base) user_name@tal-gpu-login:~$
-```+</code>
  
  
Ligne 89: Ligne 86:
 We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad:  We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad: 
  
-```bash+<code bash>
 # INSIDE the lab commands # INSIDE the lab commands
 # copying one file from your computer to the Labex server # copying one file from your computer to the Labex server
-$ scp my_file.txt user_name@ssh.tal.univ-paris13.fr:~/+$ scp my_filetxt user_name@ssh.tal.univ-paris13.fr:~/
 # copying a whole folder  # copying a whole folder 
 $ scp -r local_folder user_name@ssh.tal.univ-paris13.fr:~/remote_folder $ scp -r local_folder user_name@ssh.tal.univ-paris13.fr:~/remote_folder
-```+</code>
 To copy data back to your local computer : To copy data back to your local computer :
-```bash+<code bash>
 # INSIDE the lab commands # INSIDE the lab commands
 my_user@my_local_computer:~$ scp user_name@ssh.tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt . my_user@my_local_computer:~$ scp user_name@ssh.tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
-```+</code>
 And from outside the lab :  And from outside the lab : 
-```bash+<code bash>
 # OUTSIDE the lab commands # OUTSIDE the lab commands
 # copying files  # copying files 
-$ scp -P 60022 my_file.txt user_name@lipnssh.univ-paris13.fr:~/+$ scp -P 60022 my_filetxt user_name@tal.lipn.univ-paris13.fr 
 # copying folders recursevly  # copying folders recursevly 
-$ scp -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder +$ scp -P 60022 -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder 
-```+</code>
 Any data that you need to copy back from the server to your computer must be copied to your NFS home: Any data that you need to copy back from the server to your computer must be copied to your NFS home:
  
-```bash+<code bash>
 #OUTSIDE the lab commands #OUTSIDE the lab commands
-user_name@lipn-tal-labex:~$ cp any_file.txt /users/username/my_folder/+user_name@lipn-tal-labex:~$ cp any_filetxt /users/username/my_folder/
 user_name@lipn-tal-labex:~$ exit user_name@lipn-tal-labex:~$ exit
 my_user@my_local_computer:~$ scp -P 60022 user_name@tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt . my_user@my_local_computer:~$ scp -P 60022 user_name@tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
-```+</code>
  
  
Ligne 128: Ligne 125:
 Check the [Miniconda documentation](https://docs.conda.io/en/latest/miniconda.html) to get the link of the latest Linux 64-bit miniconda installer to use with the `wget`command.  Check the [Miniconda documentation](https://docs.conda.io/en/latest/miniconda.html) to get the link of the latest Linux 64-bit miniconda installer to use with the `wget`command. 
  
-```bash+<code bash>
 $ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh $ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
 $ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh $ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh
-```+</code>
  
 The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed.  The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed. 
  
-```bash+<code bash>
 Welcome to Miniconda3 py37_4.10.3 Welcome to Miniconda3 py37_4.10.3
  
Ligne 152: Ligne 149:
 Do you accept the license terms? [yes|no] Do you accept the license terms? [yes|no]
 [no] >>> yes [no] >>> yes
-```+</code>
  
 Choose the installation path on your NFS home Choose the installation path on your NFS home
  
-```bash+<code bash>
 Miniconda3 will now be installed into this location: Miniconda3 will now be installed into this location:
 /home/garciaflores/miniconda3 /home/garciaflores/miniconda3
Ligne 165: Ligne 162:
  
 [/home/username/miniconda3] >>> /home/username/code/python/miniconda3 [/home/username/miniconda3] >>> /home/username/code/python/miniconda3
-```+</code>
  
-Now you will be asked if you want to add *conda* base environment in your `.bashrc` file. Answer yes. +Now you will be asked if you want to add *conda* base environment in your `bashrc` file. Answer yes. 
  
-```bash+<code bash>
 Preparing transaction: done Preparing transaction: done
 Executing transaction: done Executing transaction: done
Ligne 176: Ligne 173:
 by running conda init? [yes|no] by running conda init? [yes|no]
 [no] >>> yes [no] >>> yes
-```+</code>
  
-Source manually your `.bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch.+Source manually your `bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch.
  
-```bash+<code bash>
 user_name@lipn-tal-labex:~$ cd ~ user_name@lipn-tal-labex:~$ cd ~
 user_name@lipn-tal-labex:~$ source .bashrc user_name@lipn-tal-labex:~$ source .bashrc
 (base) user_name@lipn-tal-labex:~$ (base) user_name@lipn-tal-labex:~$
-```+</code>
  
  
Ligne 192: Ligne 189:
 Install the  [Pytorch](https://pytorch.org/) framework with in your base miniconda environment Install the  [Pytorch](https://pytorch.org/) framework with in your base miniconda environment
  
-```bash+<code bash>
 (base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch (base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
 [...] [...]
Ligne 204: Ligne 201:
  
 Proceed ([y]/n)? y Proceed ([y]/n)? y
-```+</code>
  
 (Type `y`to proceed). After a while, you need to test your Pytorch install.  (Type `y`to proceed). After a while, you need to test your Pytorch install. 
  
-To test it, create the following `gpu_test.py` program with your favorite editor+To test it, create the following `gpu_testpy` program with your favorite editor
  
-```python+</code>python
 # Python program to count GPU cards in the server using Pytorch # Python program to count GPU cards in the server using Pytorch
 import torch import torch
-available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]+available_gpus = [torchcudadevice(i) for i in range(torchcudadevice_count())]
 for gpu in available_gpus: for gpu in available_gpus:
     print(gpu)     print(gpu)
-```+</code>
  
 Then run it with the *SLURM* command `srun` Then run it with the *SLURM* command `srun`
  
-```bash +<code bash> 
-(base) user_name@lipn-tal-labex:~$ srun python3 gpu_test.py +(base) user_name@lipn-tal-labex:~$ srun python3 gpu_testpy 
-<torch.cuda.device object at 0x7f29f0602d10> +<torchcudadevice object at 0x7f29f0602d10> 
-<torch.cuda.device object at 0x7f29f0602d90> +<torchcudadevice object at 0x7f29f0602d90> 
-<torch.cuda.device object at 0x7f29f0602e90> +<torchcudadevice object at 0x7f29f0602e90> 
-<torch.cuda.device object at 0x7f29f0618cd0> +<torchcudadevice object at 0x7f29f0618cd0> 
-<torch.cuda.device object at 0x7f29f0618d10> +<torchcudadevice object at 0x7f29f0618d10> 
-<torch.cuda.device object at 0x7f29f0618d90> +<torchcudadevice object at 0x7f29f0618d90> 
-<torch.cuda.device object at 0x7f29f0618dd0> +<torchcudadevice object at 0x7f29f0618dd0> 
-<torch.cuda.device object at 0x7f29f0618e10> +<torchcudadevice object at 0x7f29f0618e10> 
-```+</code>
  
 ## 4. Using `slurm` to run your code ## 4. Using `slurm` to run your code
Ligne 236: Ligne 233:
 [Slurm](https://slurm.schedmd.com/overview.html) is the Linux workload manager we use at LIPN to schedule and queue GPU jobs.  [Slurm](https://slurm.schedmd.com/overview.html) is the Linux workload manager we use at LIPN to schedule and queue GPU jobs. 
  
-### `srun`+### srun
 This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`. This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`.
-```bash+<code bash>
 $ srun nvidia-smi $ srun nvidia-smi
 Mon Nov 29 13:27:00 2021        Mon Nov 29 13:27:00 2021       
Ligne 268: Ligne 265:
 +-----------------------------------------------------------------------------+ +-----------------------------------------------------------------------------+
  
-```+</code>
  
 **You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`**  **You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`** 
  
-### `sinfo` and `scontrol`+### sinfo scontrol
 This command shows how many nodes are available in the server.  This command shows how many nodes are available in the server. 
  
-```bash+<code bash>
 $ sinfo  $ sinfo 
 PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
 labex*       up   infinite      1    mix tal-gpu-labex1 labex*       up   infinite      1    mix tal-gpu-labex1
-```+</code>
 If you want to check the particular configuration of a node, use `scontrol` If you want to check the particular configuration of a node, use `scontrol`
-```bash+<code bash>
 $ scontrol show node tal-gpu-labex1 $ scontrol show node tal-gpu-labex1
 NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10  NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10 
Ligne 299: Ligne 296:
    CurrentWatts=0 AveWatts=0    CurrentWatts=0 AveWatts=0
    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s    ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
-```+</code>
  
-### `squeue`+### squeue
  
 If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with .  If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with . 
  
-```bash+<code bash>
 $ squeue $ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-              8795     labex QKVRegLA ghazi.fe  R 2-23:50:30      1 tal-gpu-labex1 +              8795     labex QKVRegLA ghazife  R 2-23:50:30      1 tal-gpu-labex1 
-              8796     labex QKVRegLA ghazi.fe  R 2-23:41:19      1 tal-gpu-labex1+              8796     labex QKVRegLA ghazife  R 2-23:41:19      1 tal-gpu-labex1
               8812     labex MicrofEx gerardo.  R      24:31      1 tal-gpu-labex1               8812     labex MicrofEx gerardo.  R      24:31      1 tal-gpu-labex1
-```+</code>
  
-### `sbatch` +### sbatch 
-If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test.py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test.py` example to use only 3 GPUs, and specifies output files for the job. +If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_testpy` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_testpy` example to use only 3 GPUs, and specifies output files for the job. 
  
-First, you will create a `myfirst_gpu_job.sh` file +First, you will create a `myfirst_gpu_jobsh` file 
  
-```bash+<code bash>
 #!/usr/bin/env bash #!/usr/bin/env bash
 #SBATCH --job-name=MyFirstJob #SBATCH --job-name=MyFirstJob
Ligne 324: Ligne 321:
 #SBATCH --qos=qos_gpu-t4 #SBATCH --qos=qos_gpu-t4
 #SBATCH --cpus-per-task=5 #SBATCH --cpus-per-task=5
-#SBATCH --output=./MyFirstJob.out +#SBATCH --output=./MyFirstJobout 
-#SBATCH --error=./MyFirstJob.err+#SBATCH --error=./MyFirstJoberr
 #SBATCH --time=100:00:00 #SBATCH --time=100:00:00
 #SBATCH --nodes=1 #SBATCH --nodes=1
 #SBATCH --cpus-per-task=5 #SBATCH --cpus-per-task=5
 #SBATCH --ntasks-per-node=1 #SBATCH --ntasks-per-node=1
-srun python available_gpus.py +srun python available_gpuspy 
-```+</code>
  
-These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob.out`+These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJobout`
  
 Then you run the script with `sbatch` Then you run the script with `sbatch`
  
-```bash +<code bash> 
-$ sbatch myfirst_gpu_job.sh +$ sbatch myfirst_gpu_jobsh 
-```+</code>
  
-### `scancel`+### scancel
  
 From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command
  
-```bash+<code bash>
 $ squeue $ squeue
              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)              JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
Ligne 351: Ligne 348:
               4760      lipn my_first garciafl PD       0:00      1 (Priority)               4760      lipn my_first garciafl PD       0:00      1 (Priority)
               4761      lipn SUPER_En   leroux PD       0:00      1 (Priority)               4761      lipn SUPER_En   leroux PD       0:00      1 (Priority)
-              4675      lipn    GGGS1 xudong.z  R 6-20:30:00      1 lipn-rtx1+              4675      lipn    GGGS1 xudongz  R 6-20:30:00      1 lipn-rtx1
               4715      lipn SUPER_En   leroux  R 5-00:03:11      1 lipn-rtx2               4715      lipn SUPER_En   leroux  R 5-00:03:11      1 lipn-rtx2
               4752      lipn SUPER_En   leroux  R 2-21:37:05      1 lipn-rtx2               4752      lipn SUPER_En   leroux  R 2-21:37:05      1 lipn-rtx2
 $ scancel 4759 $ scancel 4759
-```+</code> 
 + 
 + 
 +## Troubleshooting
  
 +Any questions about this doc, write to [Jorge Garcia Flores](mailto:jgflores@lipn.fr).
  • Dernière modification: il y a 2 ans