public:support:labex-efl-gpu

Différences

Ci-dessous, les différences entre deux révisions de la page.

Lien vers cette vue comparative

Les deux révisions précédentes Révision précédente
Prochaine révision
Révision précédente
Prochaine révision Les deux révisions suivantes
public:support:labex-efl-gpu [2021/11/30 13:16]
garciaflores
public:support:labex-efl-gpu [2023/02/13 14:57]
garciaflores
Ligne 1: Ligne 1:
-LIPN Deep Learning servers+GPU 
 +[@LipnLab](https://twitter.com/LipnLab) has two servers dedicated to calculations with GPU accelerators for Deep Learning research related tasks:
  
-Any questions about this docwrite to [Jorge Garcia Flores](mailto:jgflores@lipn.fr) or to the lipn-gpu list (you need to be granted access first...)+## LIPN GPU  
 +This server has **11 GPUs Nvidia Quadro RTX 5000 with 16 GB of RAM each **, distributed in three nodes. Access to this server is reserved to @LipnLab researchersguests and associated members. You need to have a [Lipn-intranet](https://sso.lipn.univ-paris13.fr/cas/login?service=https%3A%2F%2Fportail.lipn.univ-paris13.fr%2Fportail%2F) account to get access to this server
  
-## 0. The LABEX EFL server+You can find documentation on how to run your code on the [[public:support:labex-efl-gpu:lipn-gpu|lipn-gpu server here.]]
  
-- **`lipn-tal-labex`**  provides access to **8 GPUs Nvidia GEForce RTX 2080 with 8GB of RAM each** in one node. You need to write an email  to [Jorge Garcia Flores](mailto:jgflores@lipn.fr) to ask for a `tal-lipn` account in order to get access to this server(Spoiler alert: for the moment, a standard LIPN intranet account  is useless for this server: you really need a `tal-lipn` account to gain access this part of the network).+## TAL/Labex EFL  
 +The server gives access to **8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each** in one node. This server is reserved to external @LipnLab [LABEX EFL](https://www.labex-efl.fr/research partners
  
 +You can read [tal-labex-gpu documentation here](https://lipn.univ-paris13.fr/wiki/doku.php?id=public:support:labex-efl-gpu:tal-labex-gpu)
  
 +## LIPN-L2TI 
 +We are currently deploying a new server with ** 8 Nvidia A40 GPUs accelerators with 48Gb of RAM each ** (documentation soon). 
  
-## 1. Connecting to the Labex EFL server +(Actually, here is some [[public:support:hp-a40-gpu|technical documentation about how to install A40 GPU drivers]] on the HP Proliant new server. 
- +
-You can connect through `ssh` protocol. (More help on ssh commands [here](https://www.ssh.com/academy/ssh/command). ) +
- +
-If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful [Villetaneuse](https://www.mairie-villetaneuse.fr/visite-guidee)) type the following command : +
- +
-```bash +
-# INSIDE the lab command +
-$ ssh user_name@ssh.tal.lipn.univ-paris13.fr +
-``` +
-Otherwise, if you are outside the LIPN, you should connect with the following command:  +
- +
-```bash +
-# OUTSIDE the lab command +
-$ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr +
-``` +
- +
- +
-After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu +
- +
-```bash +
-################################################################### +
-# Bienvenue sur le cluster TAL                                    # +
-#                                                                 # +
-# Cette machine vous propose plusieurs services dédiés au calcul. # +
-# Tal est un ensemble de machines stand-alone destinées à du      # +
-# développement ou à des tâches légères.                          # +
-# La liste des noeuds utilisables est visible ci-dessous.         # +
-#                                                                 # +
-# Pour toute remarque, ecrivez a systeme@lipn.univ-paris13.fr     # +
-#                                                                 # +
-################################################################### +
- +
- +
-1) Wiki             8) Inception      15) CPU            22) BNI2 +
-2) Redmine          9) Eswc           16) Citeseer       23) LexEx +
-3) GPU             10) Wirarika       17) SolR           24) Stanbol +
-4) Neoveille       11) Kilroy         18) Morfetik       25) Quitter TAL +
-5) Hybride         12) Bni            19) Ganso +
-6) Unoporuno       13) Cartographies  20) LabexGPU +
-7) Sdmc            14) GPU2           21) CheneTAL +
-Votre choix : 20 +
-``` +
- +
-Press enter. You should see the following message:  +
- +
-```bash +
-Warning: Permanently added '192.168.69.77' (ECDSA) to the list of known hosts. +
-Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 +
------------------------------------------------------- +
-       Bienvenue sur le serveur GPU du LABEX +
- +
-Cette machine utilise le gestionnaire de tâches SLURM.  +
- +
-Pour utiliser les ressources CPU et GPU de la machine,  +
-veuillez utiliser les commandes sbatch et srun.  +
- +
-Pour plus d'informations, visitez le site : +
- +
------------------------------------------------------- +
-Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3 +
-user_name@tal-gpu-login:~$ +
-``` +
- +
-**For the moment, you need to source manually the `.bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment  (see section 3). So, each time you login, you need to type** +
- +
-```bash +
-user_name@tal-gpu-login:~$ source .bashrc +
-``` +
- +
-Once you install *miniconda* and *Pytorch* (after you run section 3 commands), you will see the base miniconda prompt.  +
- +
-```bash +
-(base) user_name@tal-gpu-login:~$ +
-``` +
- +
- +
- +
-## 2. Copying data +
- +
-You have a NFS home in the server associated to your user. You will copy your training data and your output models to/from this home, which is always accessible from the GPU server.  +
- +
-We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad:  +
- +
-```bash +
-# INSIDE the lab commands +
-# copying one file from your computer to the Labex server +
-$ scp my_file.txt user_name@ssh.tal.univ-paris13.fr:~/ +
-# copying a whole folder  +
-$ scp -r local_folder user_name@ssh.tal.univ-paris13.fr:~/remote_folder +
-``` +
-To copy data back to your local computer : +
-```bash +
-# INSIDE the lab commands +
-my_user@my_local_computer:~$ scp user_name@ssh.tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt . +
-``` +
-And from outside the lab :  +
-```bash +
-# OUTSIDE the lab commands +
-# copying files  +
-$ scp -P 60022 my_file.txt user_name@lipnssh.univ-paris13.fr:~/ +
-# copying folders recursevly  +
-$ scp -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder +
-``` +
-Any data that you need to copy back from the server to your computer must be copied to your NFS home: +
- +
-```bash +
-#OUTSIDE the lab commands +
-user_name@lipn-tal-labex:~$ cp any_file.txt /users/username/my_folder/ +
-user_name@lipn-tal-labex:~$ exit +
-my_user@my_local_computer:~$ scp -P 60022 user_name@tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt . +
-``` +
- +
- +
- +
-## 3. Installing Python libraries +
- +
-Before running python code on the GPU servers, you need to install Miniconda and Pytorch.  +
- +
-### Miniconda +
- +
-Check the [Miniconda documentation](https://docs.conda.io/en/latest/miniconda.html) to get the link of the latest Linux 64-bit miniconda installer to use with the `wget`command.  +
- +
-```bash +
-$ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh +
-$ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh +
-``` +
- +
-The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed.  +
- +
-```bash +
-Welcome to Miniconda3 py37_4.10.3 +
- +
-In order to continue the installation process, please review the license +
-agreement. +
-Please, press ENTER to continue +
->>>  +
-=================================== +
-End User License Agreement - Miniconda +
-=================================== +
- +
-Copyright 2015-2021, Anaconda, Inc. +
- +
-All rights reserved under the 3-clause BSD License: +
-[...] +
-Do you accept the license terms? [yes|no] +
-[no] >>> yes +
-``` +
- +
-Choose the installation path on your NFS home +
- +
-```bash +
-Miniconda3 will now be installed into this location: +
-/home/garciaflores/miniconda3 +
- +
-  - Press ENTER to confirm the location +
-  - Press CTRL-C to abort the installation +
-  - Or specify a different location below +
- +
-[/home/username/miniconda3] >>> /home/username/code/python/miniconda3 +
-``` +
- +
-Now you will be asked if you want to add *conda* base environment in your `.bashrc` file. Answer yes.  +
- +
-```bash +
-Preparing transaction: done +
-Executing transaction: done +
-installation finished. +
-Do you wish the installer to initialize Miniconda3 +
-by running conda init? [yes|no] +
-[no] >>> yes +
-``` +
- +
-Source manually your `.bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch. +
- +
-```bash +
-user_name@lipn-tal-labex:~$ cd ~ +
-user_name@lipn-tal-labex:~$ source .bashrc +
-(base) user_name@lipn-tal-labex:~$ +
-``` +
- +
- +
- +
-### Pytorch +
- +
-Install the  [Pytorch](https://pytorch.org/) framework with in your base miniconda environment +
- +
-```bash +
-(base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch +
-[...] +
-  pillow             pkgs/main/linux-64::pillow-8.4.0-py37h5aabda8_0 +
-  pytorch            pytorch/linux-64::pytorch-1.10.0-py3.7_cpu_0 +
-  pytorch-mutex      pytorch/noarch::pytorch-mutex-1.0-cpu +
-  torchaudio         pytorch/linux-64::torchaudio-0.10.0-py37_cpu +
-  torchvision        pytorch/linux-64::torchvision-0.11.1-py37_cpu +
-  typing_extensions  pkgs/main/noarch::typing_extensions-3.10.0.2-pyh06a4308_0 +
-  zstd               pkgs/main/linux-64::zstd-1.4.9-haebb681_0 +
- +
-Proceed ([y]/n)? y +
-``` +
- +
-(Type `y`to proceed). After a while, you need to test your Pytorch install.  +
- +
-To test it, create the following `gpu_test.py` program with your favorite editor +
- +
-```python +
-# Python program to count GPU cards in the server using Pytorch +
-import torch +
-available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())] +
-for gpu in available_gpus: +
-    print(gpu) +
-``` +
- +
-Then run it with the *SLURM* command `srun` +
- +
-```bash +
-(base) user_name@lipn-tal-labex:~$ srun python3 gpu_test.py +
-<torch.cuda.device object at 0x7f29f0602d10> +
-<torch.cuda.device object at 0x7f29f0602d90> +
-<torch.cuda.device object at 0x7f29f0602e90> +
-<torch.cuda.device object at 0x7f29f0618cd0> +
-<torch.cuda.device object at 0x7f29f0618d10> +
-<torch.cuda.device object at 0x7f29f0618d90> +
-<torch.cuda.device object at 0x7f29f0618dd0> +
-<torch.cuda.device object at 0x7f29f0618e10> +
-``` +
- +
-## 4. Using `slurm` to run your code +
- +
-[Slurm](https://slurm.schedmd.com/overview.html) is the Linux workload manager we use at LIPN to schedule and queue GPU jobs.  +
- +
-### `srun` +
-This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`. +
-```bash +
-t$ srun nvidia-smi +
-Mon Nov 29 13:27:00 2021        +
-+-----------------------------------------------------------------------------+ +
-| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     | +
-|-------------------------------+----------------------+----------------------+ +
-| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | +
-| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | +
-|                                                    |               MIG M. | +
-|===============================+======================+======================| +
-|    NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A | +
-| 26%   32C    P8    13W / 215W |      1MiB /  7982MiB |      0%      Default | +
-|                                                    |                  N/A | +
-+-------------------------------+----------------------+----------------------+ +
-|    NVIDIA GeForce ...  On   | 00000000:1B:00.0 Off |                  N/A | +
-| 27%   31C    P8    19W / 215W |      1MiB /  7982MiB |      0%      Default | +
-|                                                    |                  N/A | +
-+-------------------------------+----------------------+----------------------+ +
- +
-[...] +
-                                                                                +
-+-----------------------------------------------------------------------------+ +
-| Processes:                                                                  | +
-|  GPU   GI   CI        PID   Type   Process name                  GPU Memory | +
-|        ID   ID                                                   Usage      | +
-|=============================================================================| +
-|    4   N/ N/A     38044      C   /usr/bin/python                  8941MiB | +
-|    6   N/ N/A     38185      C   /usr/bin/python                  3889MiB | +
-+-----------------------------------------------------------------------------+ +
- +
-``` +
- +
-**You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`**  +
- +
-### `sinfo` and `scontrol` +
-This command shows how many nodes are available in the server.  +
- +
-```bash +
-$ sinfo  +
-PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST +
-labex*       up   infinite      1    mix tal-gpu-labex1 +
-``` +
-If you want to check the particular configuration of a node, use `scontrol` +
-```bash +
-$ scontrol show node tal-gpu-labex1 +
-NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10  +
-   CPUAlloc=12 CPUTot=40 CPULoad=2.19 +
-   AvailableFeatures=(null) +
-   ActiveFeatures=(null) +
-   Gres=gpu:8,mps:800 +
-   NodeAddr=192.168.69.101 NodeHostName=tal-gpu-labex1 Version=19.05.5 +
-   OS=Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18)  +
-   RealMemory=128553 AllocMem=0 FreeMem=169091 Sockets=2 Boards=1 +
-   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/+
-   Partitions=labex  +
-   BootTime=2021-08-02T15:48:18 SlurmdStartTime=2021-08-02T15:48:35 +
-   CfgTRES=cpu=40,mem=128553M,billing=40 +
-   AllocTRES=cpu=12 +
-   CapWatts=n/+
-   CurrentWatts=0 AveWatts=0 +
-   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/+
-``` +
- +
-### `squeue` +
- +
-If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with .  +
- +
-```bash +
-$ squeue +
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) +
-              8795     labex QKVRegLA ghazi.fe  R 2-23:50:30      1 tal-gpu-labex1 +
-              8796     labex QKVRegLA ghazi.fe  R 2-23:41:19      1 tal-gpu-labex1 +
-              8812     labex MicrofEx gerardo.  R      24:31      1 tal-gpu-labex1 +
-``` +
- +
-### `sbatch` +
-If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test.py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test.py` example to use only 3 GPUs, and specifies output files for the job.  +
- +
-First, you will create a `myfirst_gpu_job.sh` file  +
- +
-```bash +
-#!/usr/bin/env bash +
-#SBATCH --job-name=MyFirstJob +
-#SBATCH --gres=gpu:+
-#SBATCH --qos=qos_gpu-t4 +
-#SBATCH --cpus-per-task=5 +
-#SBATCH --output=./MyFirstJob.out +
-#SBATCH --error=./MyFirstJob.err +
-#SBATCH --time=100:00:00 +
-#SBATCH --nodes=1 +
-#SBATCH --cpus-per-task=5 +
-#SBATCH --ntasks-per-node=1 +
-srun python available_gpus.py +
-``` +
- +
-These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob.out` +
- +
-Then you run the script with `sbatch` +
- +
-```bash +
-$ sbatch myfirst_gpu_job.sh +
-``` +
- +
-### `scancel` +
- +
-From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command +
- +
-```bash +
-$ squeue +
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON) +
-              4759      lipn my_first garciafl PD       0:00      3 (Resources) +
-              4760      lipn my_first garciafl PD       0:00      1 (Priority) +
-              4761      lipn SUPER_En   leroux PD       0:00      1 (Priority) +
-              4675      lipn    GGGS1 xudong.z  R 6-20:30:00      1 lipn-rtx1 +
-              4715      lipn SUPER_En   leroux  R 5-00:03:11      1 lipn-rtx2 +
-              4752      lipn SUPER_En   leroux  R 2-21:37:05      1 lipn-rtx2 +
-$ scancel 4759 +
-``` +
- +
  • Dernière modification: il y a 10 mois