Différences

Ci-dessous, les différences entre deux révisions de la page.

--- public:support:labex-efl-gpu [2021/11/30 13:16]
garciaflores
+++ public:support:labex-efl-gpu [2023/02/13 14:57]
garciaflores
@@ Ligne 1: / Ligne 1: @@
-# LIPN Deep Learning servers
+# GPU
+[@LipnLab](https://twitter.com/LipnLab) has two servers dedicated to calculations with GPU accelerators for Deep Learning research related tasks:
-Any questions about this doc, write to [Jorge Garcia Flores](mailto:jgflores@lipn.fr) or to the lipn-gpu list (you need to be granted access first...)
+## LIPN GPU
+This server has **11 GPUs Nvidia Quadro RTX 5000 with 16 GB of RAM each **, distributed in three nodes. Access to this server is reserved to @LipnLab researchers, guests and associated members. You need to have a [Lipn-intranet](https://sso.lipn.univ-paris13.fr/cas/login?service=https%3A%2F%2Fportail.lipn.univ-paris13.fr%2Fportail%2F) account to get access to this server.
-## 0. The LABEX EFL server
+You can find documentation on how to run your code on the [[public:support:labex-efl-gpu:lipn-gpu|lipn-gpu server here.]]
-- **`lipn-tal-labex`**  provides access to **8 GPUs Nvidia GEForce RTX 2080 with 8GB of RAM each** in one node. You need to write an email  to [Jorge Garcia Flores](mailto:jgflores@lipn.fr) to ask for a `tal-lipn` account in order to get access to this server. (Spoiler alert: for the moment, a standard LIPN intranet account  is useless for this server: you really need a `tal-lipn` account to gain access this part of the network).
+## TAL/Labex EFL
+The server gives access to **8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each** in one node. This server is reserved to external @LipnLab [LABEX EFL](https://www.labex-efl.fr/) research partners.
+You can read [tal-labex-gpu documentation here](https://lipn.univ-paris13.fr/wiki/doku.php?id=public:support:labex-efl-gpu:tal-labex-gpu)
+## LIPN-L2TI
+We are currently deploying a new server with ** 8 Nvidia A40 GPUs accelerators with 48Gb of RAM each ** (documentation soon).
-## 1. Connecting to the Labex EFL server
+(Actually, here is some [[public:support:hp-a40-gpu|technical documentation about how to install A40 GPU drivers]] on the HP Proliant new server.
-You can connect through `ssh` protocol. (More help on ssh commands [here](https://www.ssh.com/academy/ssh/command). )
-If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful [Villetaneuse](https://www.mairie-villetaneuse.fr/visite-guidee)) type the following command :
-```bash
-# INSIDE the lab command
-$ ssh user_name@ssh.tal.lipn.univ-paris13.fr
-```
-Otherwise, if you are outside the LIPN, you should connect with the following command:
-```bash
-# OUTSIDE the lab command
-$ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr
-```
-After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu
-```bash
-###################################################################
-# Bienvenue sur le cluster TAL                                    #
-#                                                                 #
-# Cette machine vous propose plusieurs services dédiés au calcul. #
-# Tal est un ensemble de machines stand-alone destinées à du      #
-# développement ou à des tâches légères.                          #
-# La liste des noeuds utilisables est visible ci-dessous.         #
-#                                                                 #
-# Pour toute remarque, ecrivez a systeme@lipn.univ-paris13.fr     #
-#                                                                 #
-###################################################################
-) Wiki             8) Inception      15) CPU            22) BNI2
-) Redmine          9) Eswc           16) Citeseer       23) LexEx
-) GPU             10) Wirarika       17) SolR           24) Stanbol
-) Neoveille       11) Kilroy         18) Morfetik       25) Quitter TAL
-) Hybride         12) Bni            19) Ganso
-) Unoporuno       13) Cartographies  20) LabexGPU
-) Sdmc            14) GPU2           21) CheneTAL
-Votre choix : 20
-```
-Press enter. You should see the following message:
-```bash
-Warning: Permanently added '192.168.69.77' (ECDSA) to the list of known hosts.
-Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
-------------------------------------------------------
-       Bienvenue sur le serveur GPU du LABEX
-Cette machine utilise le gestionnaire de tâches SLURM.
-Pour utiliser les ressources CPU et GPU de la machine,
-veuillez utiliser les commandes sbatch et srun.
-Pour plus d'informations, visitez le site :
-------------------------------------------------------
-Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3
-user_name@tal-gpu-login:~$
-```
-**For the moment, you need to source manually the `.bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment  (see section 3). So, each time you login, you need to type**
-```bash
-user_name@tal-gpu-login:~$ source .bashrc
-```
-Once you install *miniconda* and *Pytorch* (after you run section 3 commands), you will see the base miniconda prompt.
-```bash
-(base) user_name@tal-gpu-login:~$
-```
-## 2. Copying data
-You have a NFS home in the server associated to your user. You will copy your training data and your output models to/from this home, which is always accessible from the GPU server.
-We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad:
-```bash
-# INSIDE the lab commands
-# copying one file from your computer to the Labex server
-$ scp my_file.txt user_name@ssh.tal.univ-paris13.fr:~/
-# copying a whole folder
-$ scp -r local_folder user_name@ssh.tal.univ-paris13.fr:~/remote_folder
-```
-To copy data back to your local computer :
-```bash
-# INSIDE the lab commands
-my_user@my_local_computer:~$ scp user_name@ssh.tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
-```
-And from outside the lab :
-```bash
-# OUTSIDE the lab commands
-# copying files
-$ scp -P 60022 my_file.txt user_name@lipnssh.univ-paris13.fr:~/
-# copying folders recursevly
-$ scp -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder
-```
-Any data that you need to copy back from the server to your computer must be copied to your NFS home:
-```bash
-#OUTSIDE the lab commands
-user_name@lipn-tal-labex:~$ cp any_file.txt /users/username/my_folder/
-user_name@lipn-tal-labex:~$ exit
-my_user@my_local_computer:~$ scp -P 60022 user_name@tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
-```
-## 3. Installing Python libraries
-Before running python code on the GPU servers, you need to install Miniconda and Pytorch.
-### Miniconda
-Check the [Miniconda documentation](https://docs.conda.io/en/latest/miniconda.html) to get the link of the latest Linux 64-bit miniconda installer to use with the `wget`command.
-```bash
-$ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
-$ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh
-```
-The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed.
-```bash
-Welcome to Miniconda3 py37_4.10.3
-In order to continue the installation process, please review the license
-agreement.
-Please, press ENTER to continue
->>>
-===================================
-End User License Agreement - Miniconda
-===================================
-Copyright 2015-2021, Anaconda, Inc.
-All rights reserved under the 3-clause BSD License:
-[...]
-Do you accept the license terms? [yes|no]
-[no] >>> yes
-```
-Choose the installation path on your NFS home
-```bash
-Miniconda3 will now be installed into this location:
-/home/garciaflores/miniconda3
-  - Press ENTER to confirm the location
-  - Press CTRL-C to abort the installation
-  - Or specify a different location below
-[/home/username/miniconda3] >>> /home/username/code/python/miniconda3
-```
-Now you will be asked if you want to add *conda* base environment in your `.bashrc` file. Answer yes.
-```bash
-Preparing transaction: done
-Executing transaction: done
-installation finished.
-Do you wish the installer to initialize Miniconda3
-by running conda init? [yes|no]
-[no] >>> yes
-```
-Source manually your `.bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch.
-```bash
-user_name@lipn-tal-labex:~$ cd ~
-user_name@lipn-tal-labex:~$ source .bashrc
-(base) user_name@lipn-tal-labex:~$
-```
-### Pytorch
-Install the  [Pytorch](https://pytorch.org/) framework with in your base miniconda environment
-```bash
-(base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
-[...]
-  pillow             pkgs/main/linux-64::pillow-8.4.0-py37h5aabda8_0
-  pytorch            pytorch/linux-64::pytorch-1.10.0-py3.7_cpu_0
-  pytorch-mutex      pytorch/noarch::pytorch-mutex-1.0-cpu
-  torchaudio         pytorch/linux-64::torchaudio-0.10.0-py37_cpu
-  torchvision        pytorch/linux-64::torchvision-0.11.1-py37_cpu
-  typing_extensions  pkgs/main/noarch::typing_extensions-3.10.0.2-pyh06a4308_0
-  zstd               pkgs/main/linux-64::zstd-1.4.9-haebb681_0
-Proceed ([y]/n)? y
-```
-(Type `y`to proceed). After a while, you need to test your Pytorch install.
-To test it, create the following `gpu_test.py` program with your favorite editor
-```python
-# Python program to count GPU cards in the server using Pytorch
-import torch
-available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
-for gpu in available_gpus:
-    print(gpu)
-```
-Then run it with the *SLURM* command `srun`
-```bash
-(base) user_name@lipn-tal-labex:~$ srun python3 gpu_test.py
-<torch.cuda.device object at 0x7f29f0602d10>
-<torch.cuda.device object at 0x7f29f0602d90>
-<torch.cuda.device object at 0x7f29f0602e90>
-<torch.cuda.device object at 0x7f29f0618cd0>
-<torch.cuda.device object at 0x7f29f0618d10>
-<torch.cuda.device object at 0x7f29f0618d90>
-<torch.cuda.device object at 0x7f29f0618dd0>
-<torch.cuda.device object at 0x7f29f0618e10>
-```
-## 4. Using `slurm` to run your code
-[Slurm](https://slurm.schedmd.com/overview.html) is the Linux workload manager we use at LIPN to schedule and queue GPU jobs.
-### `srun`
-This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`.
-```bash
-t$ srun nvidia-smi
-Mon Nov 29 13:27:00 2021
-+-----------------------------------------------------------------------------+
-| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
-|-------------------------------+----------------------+----------------------+
-| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
-|                               |                      |               MIG M. |
-|===============================+======================+======================|
-|   0  NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A |
-| 26%   32C    P8    13W / 215W |      1MiB /  7982MiB |      0%      Default |
-|                               |                      |                  N/A |
-+-------------------------------+----------------------+----------------------+
-|   1  NVIDIA GeForce ...  On   | 00000000:1B:00.0 Off |                  N/A |
-| 27%   31C    P8    19W / 215W |      1MiB /  7982MiB |      0%      Default |
-|                               |                      |                  N/A |
-+-------------------------------+----------------------+----------------------+
-[...]
-+-----------------------------------------------------------------------------+
-| Processes:                                                                  |
-|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
-|        ID   ID                                                   Usage      |
-|=============================================================================|
-|    4   N/A  N/A     38044      C   /usr/bin/python                  8941MiB |
-|    6   N/A  N/A     38185      C   /usr/bin/python                  3889MiB |
-+-----------------------------------------------------------------------------+
-```
-**You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`**
-### `sinfo` and `scontrol`
-This command shows how many nodes are available in the server.
-```bash
-$ sinfo
-PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
-labex*       up   infinite      1    mix tal-gpu-labex1
-```
-If you want to check the particular configuration of a node, use `scontrol`
-```bash
-$ scontrol show node tal-gpu-labex1
-NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10
-   CPUAlloc=12 CPUTot=40 CPULoad=2.19
-   AvailableFeatures=(null)
-   ActiveFeatures=(null)
-   Gres=gpu:8,mps:800
-   NodeAddr=192.168.69.101 NodeHostName=tal-gpu-labex1 Version=19.05.5
-   OS=Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18)
-   RealMemory=128553 AllocMem=0 FreeMem=169091 Sockets=2 Boards=1
-   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
-   Partitions=labex
-   BootTime=2021-08-02T15:48:18 SlurmdStartTime=2021-08-02T15:48:35
-   CfgTRES=cpu=40,mem=128553M,billing=40
-   AllocTRES=cpu=12
-   CapWatts=n/a
-   CurrentWatts=0 AveWatts=0
-   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
-```
-### `squeue`
-If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with .
-```bash
-$ squeue
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-     labex QKVRegLA ghazi.fe  R 2-23:50:30      1 tal-gpu-labex1
-     labex QKVRegLA ghazi.fe  R 2-23:41:19      1 tal-gpu-labex1
-     labex MicrofEx gerardo.  R      24:31      1 tal-gpu-labex1
-```
-### `sbatch`
-If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test.py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test.py` example to use only 3 GPUs, and specifies output files for the job.
-First, you will create a `myfirst_gpu_job.sh` file
-```bash
-#!/usr/bin/env bash
-#SBATCH --job-name=MyFirstJob
-#SBATCH --gres=gpu:3
-#SBATCH --qos=qos_gpu-t4
-#SBATCH --cpus-per-task=5
-#SBATCH --output=./MyFirstJob.out
-#SBATCH --error=./MyFirstJob.err
-#SBATCH --time=100:00:00
-#SBATCH --nodes=1
-#SBATCH --cpus-per-task=5
-#SBATCH --ntasks-per-node=1
-srun python available_gpus.py
-```
-These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob.out`
-Then you run the script with `sbatch`
-```bash
-$ sbatch myfirst_gpu_job.sh
-```
-### `scancel`
-From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command
-```bash
-$ squeue
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-      lipn my_first garciafl PD       0:00      3 (Resources)
-      lipn my_first garciafl PD       0:00      1 (Priority)
-      lipn SUPER_En   leroux PD       0:00      1 (Priority)
-      lipn    GGGS1 xudong.z  R 6-20:30:00      1 lipn-rtx1
-      lipn SUPER_En   leroux  R 5-00:03:11      1 lipn-rtx2
-      lipn SUPER_En   leroux  R 2-21:37:05      1 lipn-rtx2
-$ scancel 4759
-```