Accesing the TAL/LABEX EFL GPU server
The server gives access to 8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each in one node. This server is reserved to external @LipnLab LABEX EFL research partners. You need to send us an email to ask for a tal-lipn
account in order to get access to this server.
1. Connecting to the server
You can connect through ssh
protocol. (More help on ssh commands here. )
If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful Villetaneuse) type the following command :
# INSIDE the lab command $ ssh user_name@ssh.tal.lipn.univ-paris13.fr
Otherwise, if you are outside the LIPN, you should connect with the following command:
# OUTSIDE the lab command $ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr
After you connected through ssh to the TAL server, you need to choose the Labex GPU virtual machine (number 20) from the login menu
################################################################### # Bienvenue sur le cluster TAL # # # # Cette machine vous propose plusieurs services dédiés au calcul. # # Tal est un ensemble de machines stand-alone destinées à du # # développement ou à des tâches légères. # # La liste des noeuds utilisables est visible ci-dessous. # # # # Pour toute remarque, ecrivez a systeme@lipn.univ-paris13.fr # # # ################################################################### 1) Wiki 8) Inception 15) CPU 22) BNI2 2) Redmine 9) Eswc 16) Citeseer 23) LexEx 3) GPU 10) Wirarika 17) SolR 24) Stanbol 4) Neoveille 11) Kilroy 18) Morfetik 25) Quitter TAL 5) Hybride 12) Bni 19) Ganso 6) Unoporuno 13) Cartographies 20) LabexGPU 7) Sdmc 14) GPU2 21) CheneTAL Votre choix : 20
Press enter. You should see the following message:
Warning: Permanently added '192.168.69.77' (ECDSA) to the list of known hosts. Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 ------------------------------------------------------ Bienvenue sur le serveur GPU du LABEX Cette machine utilise le gestionnaire de tâches SLURM. Pour utiliser les ressources CPU et GPU de la machine, veuillez utiliser les commandes sbatch et srun. Pour plus d'informations, visitez le site : ------------------------------------------------------ Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3 user_name@tal-gpu-login:~$
For the moment, you need to source manually the ‧bashrc
file of your NFS home every time you connect to the Labex server, in order to activate your miniconda GPU environment (see section 3). So, each time you login, you need to type
user_name@tal-gpu-login:~$ source .bashrc
Once you install miniconda and Pytorch (after you run section 3 commands), you will see the base miniconda prompt.
(base) user_name@tal-gpu-login:~$
2. Copying data
You have a NFS home in the server associated to your user. You will copy your training data and your output models to/from this home, which is always accessible from the GPU server.
We use the scp
command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad:
# INSIDE the lab commands # copying one file from your computer to the Labex server $ scp my_file‧txt user_name@ssh.tal.univ-paris13.fr:~/ # copying a whole folder $ scp -r local_folder user_name@ssh.tal.univ-paris13.fr:~/remote_folder
To copy data back to your local computer :
# INSIDE the lab commands my_user@my_local_computer:~$ scp user_name@ssh.tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
And from outside the lab :
# OUTSIDE the lab commands # copying files $ scp -P 60022 my_file‧txt user_name@tal.lipn.univ-paris13.fr # copying folders recursevly $ scp -P 60022 -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder
Any data that you need to copy back from the server to your computer must be copied to your NFS home:
#OUTSIDE the lab commands user_name@lipn-tal-labex:~$ cp any_file‧txt /users/username/my_folder/ user_name@lipn-tal-labex:~$ exit my_user@my_local_computer:~$ scp -P 60022 user_name@tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .
3. Installing Python libraries
Before running python code on the GPU servers, you need to install Miniconda and Pytorch.
Miniconda
Check the Miniconda documentation to get the link of the latest Linux 64-bit miniconda installer to use with the wget
command.
$ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh $ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh
The installation script will run. You must type space until the end of the license agreement, and then write yes
to proceed.
Welcome to Miniconda3 py37_4.10.3 In order to continue the installation process, please review the license agreement. Please, press ENTER to continue >>> =================================== End User License Agreement - Miniconda =================================== Copyright 2015-2021, Anaconda, Inc. All rights reserved under the 3-clause BSD License: [...] Do you accept the license terms? [yes|no] [no] >>> yes
Choose the installation path on your NFS home
Miniconda3 will now be installed into this location: /home/garciaflores/miniconda3 - Press ENTER to confirm the location - Press CTRL-C to abort the installation - Or specify a different location below [/home/username/miniconda3] >>> /home/username/code/python/miniconda3
Now you will be asked if you want to add conda base environment in your ‧bashrc
file. Answer yes.
Preparing transaction: done Executing transaction: done installation finished. Do you wish the installer to initialize Miniconda3 by running conda init? [yes|no] [no] >>> yes
Source manually your ‧bashrc
file on your NFS home to activate the miniconda environment before installing Pytorch.
user_name@lipn-tal-labex:~$ cd ~ user_name@lipn-tal-labex:~$ source .bashrc (base) user_name@lipn-tal-labex:~$
Pytorch
Install the Pytorch framework with in your base miniconda environment
(base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch [...] pillow pkgs/main/linux-64::pillow-8.4.0-py37h5aabda8_0 pytorch pytorch/linux-64::pytorch-1.10.0-py3.7_cpu_0 pytorch-mutex pytorch/noarch::pytorch-mutex-1.0-cpu torchaudio pytorch/linux-64::torchaudio-0.10.0-py37_cpu torchvision pytorch/linux-64::torchvision-0.11.1-py37_cpu typing_extensions pkgs/main/noarch::typing_extensions-3.10.0.2-pyh06a4308_0 zstd pkgs/main/linux-64::zstd-1.4.9-haebb681_0 Proceed ([y]/n)? y
(Type y
to proceed). After a while, you need to test your Pytorch install.
To test it, create the following gpu_test‧py
program with your favorite editor
</code>python
Python program to count GPU cards in the server using Pytorch
import torch availablegpus = [torch‧cuda‧device(i) for i in range(torch‧cuda‧devicecount())] for gpu in available_gpus:
print(gpu)
</code>
Then run it with the SLURM command srun
(base) user_name@lipn-tal-labex:~$ srun python3 gpu_test‧py <torch‧cuda‧device object at 0x7f29f0602d10> <torch‧cuda‧device object at 0x7f29f0602d90> <torch‧cuda‧device object at 0x7f29f0602e90> <torch‧cuda‧device object at 0x7f29f0618cd0> <torch‧cuda‧device object at 0x7f29f0618d10> <torch‧cuda‧device object at 0x7f29f0618d90> <torch‧cuda‧device object at 0x7f29f0618dd0> <torch‧cuda‧device object at 0x7f29f0618e10>
4. Using `slurm` to run your code
Slurm is the Linux workload manager we use at LIPN to schedule and queue GPU jobs.
$ srun
This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running nvidia-smi
command with srun
.
$ srun nvidia-smi Mon Nov 29 13:27:00 2021 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.57.02 Driver Version: 470.57.02 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:1A:00.0 Off | N/A | | 26% 32C P8 13W / 215W | 1MiB / 7982MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA GeForce ... On | 00000000:1B:00.0 Off | N/A | | 27% 31C P8 19W / 215W | 1MiB / 7982MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ [...] +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 4 N/A N/A 38044 C /usr/bin/python 8941MiB | | 6 N/A N/A 38185 C /usr/bin/python 3889MiB | +-----------------------------------------------------------------------------+
You can use it to run Python code, but as you are working in a shared server, it is better to run your code with sbatch
$ sinfo / scontrol
This command shows how many nodes are available in the server.
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST labex* up infinite 1 mix tal-gpu-labex1
If you want to check the particular configuration of a node, use scontrol
$ scontrol show node tal-gpu-labex1 NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10 CPUAlloc=12 CPUTot=40 CPULoad=2.19 AvailableFeatures=(null) ActiveFeatures=(null) Gres=gpu:8,mps:800 NodeAddr=192.168.69.101 NodeHostName=tal-gpu-labex1 Version=19.05.5 OS=Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) RealMemory=128553 AllocMem=0 FreeMem=169091 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A Partitions=labex BootTime=2021-08-02T15:48:18 SlurmdStartTime=2021-08-02T15:48:35 CfgTRES=cpu=40,mem=128553M,billing=40 AllocTRES=cpu=12 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
$ squeue
If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with .
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 8795 labex QKVRegLA ghazi‧fe R 2-23:50:30 1 tal-gpu-labex1 8796 labex QKVRegLA ghazi‧fe R 2-23:41:19 1 tal-gpu-labex1 8812 labex MicrofEx gerardo. R 24:31 1 tal-gpu-labex1
$ sbatch
If you simply run your code with srun
, your job will try to use all the available resources (like in the gpu_test‧py
example from Section 3 - Pytorch) . So the sbatch
command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the gpu_test‧py
example to use only 3 GPUs, and specifies output files for the job.
First, you will create a myfirst_gpu_job‧sh
file
#!/usr/bin/env bash #SBATCH --job-name=MyFirstJob #SBATCH --gres=gpu:3 #SBATCH --qos=qos_gpu-t4 #SBATCH --cpus-per-task=5 #SBATCH --output=./MyFirstJob‧out #SBATCH --error=./MyFirstJob‧err #SBATCH --time=100:00:00 #SBATCH --nodes=1 #SBATCH --cpus-per-task=5 #SBATCH --ntasks-per-node=1 srun python available_gpus‧py
These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to MyFirstJob‧out
Then you run the script with sbatch
$ sbatch myfirst_gpu_job‧sh
$ scancel
From time to time you need to kill a job. You need to use the JOBID
number from the squeue
command
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 4759 lipn my_first garciafl PD 0:00 3 (Resources) 4760 lipn my_first garciafl PD 0:00 1 (Priority) 4761 lipn SUPER_En leroux PD 0:00 1 (Priority) 4675 lipn GGGS1 xudong‧z R 6-20:30:00 1 lipn-rtx1 4715 lipn SUPER_En leroux R 5-00:03:11 1 lipn-rtx2 4752 lipn SUPER_En leroux R 2-21:37:05 1 lipn-rtx2 $ scancel 4759
Troubleshooting
Any questions about this doc, write to Jorge Garcia Flores.