Différences
Ci-dessous, les différences entre deux révisions de la page.
Les deux révisions précédentes Révision précédente Prochaine révision | Révision précédente Prochaine révision Les deux révisions suivantes | ||
public:support:labex-efl-gpu [2021/11/30 13:16] garciaflores |
public:support:labex-efl-gpu [2023/02/13 14:57] garciaflores |
||
---|---|---|---|
Ligne 1: | Ligne 1: | ||
- | # LIPN Deep Learning | + | # GPU |
+ | [@LipnLab](https:// | ||
- | Any questions about this doc, write to [Jorge Garcia Flores](mailto:jgflores@lipn.fr) or to the lipn-gpu list (you need to be granted | + | ## LIPN GPU |
+ | This server has **11 GPUs Nvidia Quadro RTX 5000 with 16 GB of RAM each **, distributed in three nodes. Access to this server is reserved to @LipnLab researchers, guests and associated members. You need to have a [Lipn-intranet](https://sso.lipn.univ-paris13.fr/ | ||
- | ## 0. The LABEX EFL server | + | You can find documentation on how to run your code on the [[public: |
- | - **`lipn-tal-labex`** | + | ## TAL/Labex EFL |
+ | The server gives access to **8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each** in one node. This server is reserved | ||
+ | You can read [tal-labex-gpu documentation here](https:// | ||
+ | ## LIPN-L2TI | ||
+ | We are currently deploying a new server with ** 8 Nvidia A40 GPUs accelerators with 48Gb of RAM each ** (documentation soon). | ||
- | ## 1. Connecting to the Labex EFL server | + | (Actually, |
- | + | ||
- | You can connect through `ssh` protocol. | + | |
- | + | ||
- | If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful | + | |
- | + | ||
- | ```bash | + | |
- | # INSIDE the lab command | + | |
- | $ ssh user_name@ssh.tal.lipn.univ-paris13.fr | + | |
- | ``` | + | |
- | Otherwise, if you are outside the LIPN, you should connect with the following command: | + | |
- | + | ||
- | ```bash | + | |
- | # OUTSIDE the lab command | + | |
- | $ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr | + | |
- | ``` | + | |
- | + | ||
- | + | ||
- | After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu | + | |
- | + | ||
- | ```bash | + | |
- | ################################################################### | + | |
- | # Bienvenue sur le cluster TAL # | + | |
- | # # | + | |
- | # Cette machine vous propose plusieurs services dédiés au calcul. # | + | |
- | # Tal est un ensemble de machines stand-alone destinées à du # | + | |
- | # développement ou à des tâches légères. | + | |
- | # La liste des noeuds utilisables est visible ci-dessous. | + | |
- | # # | + | |
- | # Pour toute remarque, ecrivez a systeme@lipn.univ-paris13.fr | + | |
- | # # | + | |
- | ################################################################### | + | |
- | + | ||
- | + | ||
- | 1) Wiki 8) Inception | + | |
- | 2) Redmine | + | |
- | 3) GPU 10) Wirarika | + | |
- | 4) Neoveille | + | |
- | 5) Hybride | + | |
- | 6) Unoporuno | + | |
- | 7) Sdmc 14) GPU2 21) CheneTAL | + | |
- | Votre choix : 20 | + | |
- | ``` | + | |
- | + | ||
- | Press enter. You should see the following message: | + | |
- | + | ||
- | ```bash | + | |
- | Warning: Permanently added ' | + | |
- | Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 | + | |
- | ------------------------------------------------------ | + | |
- | | + | |
- | + | ||
- | Cette machine utilise le gestionnaire de tâches SLURM. | + | |
- | + | ||
- | Pour utiliser les ressources CPU et GPU de la machine, | + | |
- | veuillez utiliser les commandes sbatch et srun. | + | |
- | + | ||
- | Pour plus d' | + | |
- | + | ||
- | ------------------------------------------------------ | + | |
- | Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3 | + | |
- | user_name@tal-gpu-login: | + | |
- | ``` | + | |
- | + | ||
- | **For the moment, you need to source manually the `.bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment | + | |
- | + | ||
- | ```bash | + | |
- | user_name@tal-gpu-login: | + | |
- | ``` | + | |
- | + | ||
- | Once you install | + | |
- | + | ||
- | ```bash | + | |
- | (base) user_name@tal-gpu-login: | + | |
- | ``` | + | |
- | + | ||
- | + | ||
- | + | ||
- | ## 2. Copying data | + | |
- | + | ||
- | You have a NFS home in the server associated to your user. You will copy your training data and your output models to/from this home, which is always accessible from the GPU server. | + | |
- | + | ||
- | We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad: | + | |
- | + | ||
- | ```bash | + | |
- | # INSIDE the lab commands | + | |
- | # copying one file from your computer to the Labex server | + | |
- | $ scp my_file.txt user_name@ssh.tal.univ-paris13.fr: | + | |
- | # copying a whole folder | + | |
- | $ scp -r local_folder user_name@ssh.tal.univ-paris13.fr: | + | |
- | ``` | + | |
- | To copy data back to your local computer : | + | |
- | ```bash | + | |
- | # INSIDE the lab commands | + | |
- | my_user@my_local_computer: | + | |
- | ``` | + | |
- | And from outside the lab : | + | |
- | ```bash | + | |
- | # OUTSIDE the lab commands | + | |
- | # copying files | + | |
- | $ scp -P 60022 my_file.txt user_name@lipnssh.univ-paris13.fr: | + | |
- | # copying folders recursevly | + | |
- | $ scp -r local_folder user_name@tal.lipn.univ-paris13.fr: | + | |
- | ``` | + | |
- | Any data that you need to copy back from the server to your computer must be copied to your NFS home: | + | |
- | + | ||
- | ```bash | + | |
- | #OUTSIDE the lab commands | + | |
- | user_name@lipn-tal-labex: | + | |
- | user_name@lipn-tal-labex: | + | |
- | my_user@my_local_computer: | + | |
- | ``` | + | |
- | + | ||
- | + | ||
- | + | ||
- | ## 3. Installing Python libraries | + | |
- | + | ||
- | Before running python code on the GPU servers, you need to install Miniconda and Pytorch. | + | |
- | + | ||
- | ### Miniconda | + | |
- | + | ||
- | Check the [Miniconda documentation](https:// | + | |
- | + | ||
- | ```bash | + | |
- | $ wget https:// | + | |
- | $ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh | + | |
- | ``` | + | |
- | + | ||
- | The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed. | + | |
- | + | ||
- | ```bash | + | |
- | Welcome to Miniconda3 py37_4.10.3 | + | |
- | + | ||
- | In order to continue the installation process, please review the license | + | |
- | agreement. | + | |
- | Please, press ENTER to continue | + | |
- | >>> | + | |
- | =================================== | + | |
- | End User License Agreement - Miniconda | + | |
- | =================================== | + | |
- | + | ||
- | Copyright 2015-2021, Anaconda, Inc. | + | |
- | + | ||
- | All rights reserved under the 3-clause BSD License: | + | |
- | [...] | + | |
- | Do you accept the license terms? [yes|no] | + | |
- | [no] >>> | + | |
- | ``` | + | |
- | + | ||
- | Choose the installation path on your NFS home | + | |
- | + | ||
- | ```bash | + | |
- | Miniconda3 will now be installed into this location: | + | |
- | / | + | |
- | + | ||
- | - Press ENTER to confirm the location | + | |
- | - Press CTRL-C to abort the installation | + | |
- | - Or specify a different location below | + | |
- | + | ||
- | [/ | + | |
- | ``` | + | |
- | + | ||
- | Now you will be asked if you want to add *conda* base environment in your `.bashrc` file. Answer yes. | + | |
- | + | ||
- | ```bash | + | |
- | Preparing transaction: | + | |
- | Executing transaction: | + | |
- | installation finished. | + | |
- | Do you wish the installer to initialize Miniconda3 | + | |
- | by running conda init? [yes|no] | + | |
- | [no] >>> | + | |
- | ``` | + | |
- | + | ||
- | Source manually your `.bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch. | + | |
- | + | ||
- | ```bash | + | |
- | user_name@lipn-tal-labex: | + | |
- | user_name@lipn-tal-labex: | + | |
- | (base) user_name@lipn-tal-labex: | + | |
- | ``` | + | |
- | + | ||
- | + | ||
- | + | ||
- | ### Pytorch | + | |
- | + | ||
- | Install the [Pytorch](https:// | + | |
- | + | ||
- | ```bash | + | |
- | (base) user_name@lipn-tal-labex: | + | |
- | [...] | + | |
- | pillow | + | |
- | pytorch | + | |
- | pytorch-mutex | + | |
- | torchaudio | + | |
- | torchvision | + | |
- | typing_extensions | + | |
- | zstd | + | |
- | + | ||
- | Proceed ([y]/n)? y | + | |
- | ``` | + | |
- | + | ||
- | (Type `y`to proceed). After a while, you need to test your Pytorch install. | + | |
- | + | ||
- | To test it, create the following `gpu_test.py` program with your favorite editor | + | |
- | + | ||
- | ```python | + | |
- | # Python program to count GPU cards in the server | + | |
- | import torch | + | |
- | available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())] | + | |
- | for gpu in available_gpus: | + | |
- | print(gpu) | + | |
- | ``` | + | |
- | + | ||
- | Then run it with the *SLURM* command `srun` | + | |
- | + | ||
- | ```bash | + | |
- | (base) user_name@lipn-tal-labex: | + | |
- | < | + | |
- | < | + | |
- | < | + | |
- | < | + | |
- | < | + | |
- | < | + | |
- | < | + | |
- | < | + | |
- | ``` | + | |
- | + | ||
- | ## 4. Using `slurm` to run your code | + | |
- | + | ||
- | [Slurm](https:// | + | |
- | + | ||
- | ### `srun` | + | |
- | This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`. | + | |
- | ```bash | + | |
- | t$ srun nvidia-smi | + | |
- | Mon Nov 29 13:27:00 2021 | + | |
- | +-----------------------------------------------------------------------------+ | + | |
- | | NVIDIA-SMI 470.57.02 | + | |
- | |-------------------------------+----------------------+----------------------+ | + | |
- | | GPU Name Persistence-M| Bus-Id | + | |
- | | Fan Temp Perf Pwr: | + | |
- | | | + | |
- | |===============================+======================+======================| | + | |
- | | | + | |
- | | 26% | + | |
- | | | + | |
- | +-------------------------------+----------------------+----------------------+ | + | |
- | | | + | |
- | | 27% | + | |
- | | | + | |
- | +-------------------------------+----------------------+----------------------+ | + | |
- | + | ||
- | [...] | + | |
- | + | ||
- | +-----------------------------------------------------------------------------+ | + | |
- | | Processes: | + | |
- | | GPU | + | |
- | | ID | + | |
- | |=============================================================================| | + | |
- | | 4 | + | |
- | | 6 | + | |
- | +-----------------------------------------------------------------------------+ | + | |
- | + | ||
- | ``` | + | |
- | + | ||
- | **You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`** | + | |
- | + | ||
- | ### `sinfo` and `scontrol` | + | |
- | This command shows how many nodes are available in the server. | + | |
- | + | ||
- | ```bash | + | |
- | $ sinfo | + | |
- | PARTITION AVAIL TIMELIMIT | + | |
- | labex* | + | |
- | ``` | + | |
- | If you want to check the particular configuration of a node, use `scontrol` | + | |
- | ```bash | + | |
- | $ scontrol show node tal-gpu-labex1 | + | |
- | NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10 | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | | + | |
- | ``` | + | |
- | + | ||
- | ### `squeue` | + | |
- | + | ||
- | If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with . | + | |
- | + | ||
- | ```bash | + | |
- | $ squeue | + | |
- | JOBID PARTITION | + | |
- | 8795 labex QKVRegLA ghazi.fe | + | |
- | 8796 labex QKVRegLA ghazi.fe | + | |
- | 8812 labex MicrofEx gerardo. | + | |
- | ``` | + | |
- | + | ||
- | ### `sbatch` | + | |
- | If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test.py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test.py` example to use only 3 GPUs, and specifies output files for the job. | + | |
- | + | ||
- | First, you will create a `myfirst_gpu_job.sh` file | + | |
- | + | ||
- | ```bash | + | |
- | # | + | |
- | #SBATCH --job-name=MyFirstJob | + | |
- | #SBATCH --gres=gpu: | + | |
- | #SBATCH --qos=qos_gpu-t4 | + | |
- | #SBATCH --cpus-per-task=5 | + | |
- | #SBATCH --output=./ | + | |
- | #SBATCH --error=./ | + | |
- | #SBATCH --time=100: | + | |
- | #SBATCH --nodes=1 | + | |
- | #SBATCH --cpus-per-task=5 | + | |
- | #SBATCH --ntasks-per-node=1 | + | |
- | srun python available_gpus.py | + | |
- | ``` | + | |
- | + | ||
- | These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob.out` | + | |
- | + | ||
- | Then you run the script with `sbatch` | + | |
- | + | ||
- | ```bash | + | |
- | $ sbatch myfirst_gpu_job.sh | + | |
- | ``` | + | |
- | + | ||
- | ### `scancel` | + | |
- | + | ||
- | From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command | + | |
- | + | ||
- | ```bash | + | |
- | $ squeue | + | |
- | JOBID PARTITION | + | |
- | 4759 lipn my_first garciafl PD | + | |
- | 4760 lipn my_first garciafl PD | + | |
- | 4761 lipn SUPER_En | + | |
- | 4675 lipn GGGS1 xudong.z | + | |
- | 4715 lipn SUPER_En | + | |
- | 4752 lipn SUPER_En | + | |
- | $ scancel 4759 | + | |
- | ``` | + | |
- | + |