Table des matières

Accesing the TAL/LABEX EFL GPU server
Python program to count GPU cards in the server using Pytorch
- 4. Using `slurm` to run your code
- Troubleshooting

Accesing the TAL/LABEX EFL GPU server

The server gives access to 8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each in one node. This server is reserved to external @LipnLab LABEX EFL research partners. You need to send us an email to ask for a tal-lipn account in order to get access to this server.

1. Connecting to the server

You can connect through ssh protocol. (More help on ssh commands here. )

If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful Villetaneuse) type the following command :

# INSIDE the lab command
$ ssh user_name@ssh.tal.lipn.univ-paris13.fr

Otherwise, if you are outside the LIPN, you should connect with the following command:

# OUTSIDE the lab command
$ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr

After you connected through ssh to the TAL server, you need to choose the Labex GPU virtual machine (number 20) from the login menu

###################################################################
# Bienvenue sur le cluster TAL                                    #
#                                                                 #
# Cette machine vous propose plusieurs services dédiés au calcul. #
# Tal est un ensemble de machines stand-alone destinées à du      #
# développement ou à des tâches légères.                          #
# La liste des noeuds utilisables est visible ci-dessous.         #
#                                                                 #
# Pour toute remarque, ecrivez a systeme@lipn.univ-paris13.fr     #
#                                                                 #
###################################################################
 
 
1) Wiki             8) Inception      15) CPU            22) BNI2
2) Redmine          9) Eswc           16) Citeseer       23) LexEx
3) GPU             10) Wirarika       17) SolR           24) Stanbol
4) Neoveille       11) Kilroy         18) Morfetik       25) Quitter TAL
5) Hybride         12) Bni            19) Ganso
6) Unoporuno       13) Cartographies  20) LabexGPU
7) Sdmc            14) GPU2           21) CheneTAL
Votre choix : 20

Press enter. You should see the following message:

Warning: Permanently added '192.168.69.77' (ECDSA) to the list of known hosts.
Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
------------------------------------------------------
       Bienvenue sur le serveur GPU du LABEX
 
Cette machine utilise le gestionnaire de tâches SLURM. 
 
Pour utiliser les ressources CPU et GPU de la machine, 
veuillez utiliser les commandes sbatch et srun. 
 
Pour plus d'informations, visitez le site :
 
------------------------------------------------------
Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3
user_name@tal-gpu-login:~$

For the moment, you need to source manually the ‧bashrc file of your NFS home every time you connect to the Labex server, in order to activate your miniconda GPU environment (see section 3). So, each time you login, you need to type

user_name@tal-gpu-login:~$ source .bashrc

Once you install miniconda and Pytorch (after you run section 3 commands), you will see the base miniconda prompt.

(base) user_name@tal-gpu-login:~$

2. Copying data

You have a NFS home in the server associated to your user. You will copy your training data and your output models to/from this home, which is always accessible from the GPU server.

We use the scp command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad:

# INSIDE the lab commands
# copying one file from your computer to the Labex server
$ scp my_file‧txt user_name@ssh.tal.univ-paris13.fr:~/
# copying a whole folder 
$ scp -r local_folder user_name@ssh.tal.univ-paris13.fr:~/remote_folder

To copy data back to your local computer :

# INSIDE the lab commands
my_user@my_local_computer:~$ scp user_name@ssh.tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .

And from outside the lab :

# OUTSIDE the lab commands
# copying files 
$ scp -P 60022 my_file‧txt user_name@tal.lipn.univ-paris13.fr 
# copying folders recursevly 
$ scp -P 60022 -r local_folder user_name@tal.lipn.univ-paris13.fr:~/remote_folder

Any data that you need to copy back from the server to your computer must be copied to your NFS home:

#OUTSIDE the lab commands
user_name@lipn-tal-labex:~$ cp any_file‧txt /users/username/my_folder/
user_name@lipn-tal-labex:~$ exit
my_user@my_local_computer:~$ scp -P 60022 user_name@tal.lipn.univ-paris13.fr:~/my_folder/any_file.txt .

3. Installing Python libraries

Before running python code on the GPU servers, you need to install Miniconda and Pytorch.

Miniconda

Check the Miniconda documentation to get the link of the latest Linux 64-bit miniconda installer to use with the wgetcommand.

$ wget https://repo.anaconda.com/miniconda/Miniconda3-py37_4.10.3-Linux-x86_64.sh
$ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh

The installation script will run. You must type space until the end of the license agreement, and then write yes to proceed.

Welcome to Miniconda3 py37_4.10.3
 
In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>> 
===================================
End User License Agreement - Miniconda
===================================
 
Copyright 2015-2021, Anaconda, Inc.
 
All rights reserved under the 3-clause BSD License:
[...]
Do you accept the license terms? [yes|no]
[no] >>> yes

Choose the installation path on your NFS home

Miniconda3 will now be installed into this location:
/home/garciaflores/miniconda3
 
  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below
 
[/home/username/miniconda3] >>> /home/username/code/python/miniconda3

Now you will be asked if you want to add conda base environment in your ‧bashrc file. Answer yes.

Preparing transaction: done
Executing transaction: done
installation finished.
Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes

Source manually your ‧bashrc file on your NFS home to activate the miniconda environment before installing Pytorch.

user_name@lipn-tal-labex:~$ cd ~
user_name@lipn-tal-labex:~$ source .bashrc
(base) user_name@lipn-tal-labex:~$

Pytorch

Install the Pytorch framework with in your base miniconda environment

(base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
[...]
  pillow             pkgs/main/linux-64::pillow-8.4.0-py37h5aabda8_0
  pytorch            pytorch/linux-64::pytorch-1.10.0-py3.7_cpu_0
  pytorch-mutex      pytorch/noarch::pytorch-mutex-1.0-cpu
  torchaudio         pytorch/linux-64::torchaudio-0.10.0-py37_cpu
  torchvision        pytorch/linux-64::torchvision-0.11.1-py37_cpu
  typing_extensions  pkgs/main/noarch::typing_extensions-3.10.0.2-pyh06a4308_0
  zstd               pkgs/main/linux-64::zstd-1.4.9-haebb681_0
 
Proceed ([y]/n)? y

(Type yto proceed). After a while, you need to test your Pytorch install.

To test it, create the following gpu_test‧py program with your favorite editor

</code>python

Python program to count GPU cards in the server using Pytorch

import torch availablegpus = [torch‧cuda‧device(i) for i in range(torch‧cuda‧devicecount())] for gpu in available_gpus:

  print(gpu)

</code>

Then run it with the SLURM command srun

(base) user_name@lipn-tal-labex:~$ srun python3 gpu_test‧py
<torch‧cuda‧device object at 0x7f29f0602d10>
<torch‧cuda‧device object at 0x7f29f0602d90>
<torch‧cuda‧device object at 0x7f29f0602e90>
<torch‧cuda‧device object at 0x7f29f0618cd0>
<torch‧cuda‧device object at 0x7f29f0618d10>
<torch‧cuda‧device object at 0x7f29f0618d90>
<torch‧cuda‧device object at 0x7f29f0618dd0>
<torch‧cuda‧device object at 0x7f29f0618e10>

4. Using `slurm` to run your code

Slurm is the Linux workload manager we use at LIPN to schedule and queue GPU jobs.

$ srun

This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running nvidia-smicommand with srun.

$ srun nvidia-smi
Mon Nov 29 13:27:00 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A |
| 26%   32C    P8    13W / 215W |      1MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:1B:00.0 Off |                  N/A |
| 27%   31C    P8    19W / 215W |      1MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
 
[...]
 
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    4   N/A  N/A     38044      C   /usr/bin/python                  8941MiB |
|    6   N/A  N/A     38185      C   /usr/bin/python                  3889MiB |
+-----------------------------------------------------------------------------+

You can use it to run Python code, but as you are working in a shared server, it is better to run your code with sbatch

$ sinfo / scontrol

This command shows how many nodes are available in the server.

$ sinfo 
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
labex*       up   infinite      1    mix tal-gpu-labex1

If you want to check the particular configuration of a node, use scontrol

$ scontrol show node tal-gpu-labex1
NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10 
   CPUAlloc=12 CPUTot=40 CPULoad=2.19
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:8,mps:800
   NodeAddr=192.168.69.101 NodeHostName=tal-gpu-labex1 Version=19.05.5
   OS=Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) 
   RealMemory=128553 AllocMem=0 FreeMem=169091 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=labex 
   BootTime=2021-08-02T15:48:18 SlurmdStartTime=2021-08-02T15:48:35
   CfgTRES=cpu=40,mem=128553M,billing=40
   AllocTRES=cpu=12
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

$ squeue

If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with .

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              8795     labex QKVRegLA ghazi‧fe  R 2-23:50:30      1 tal-gpu-labex1
              8796     labex QKVRegLA ghazi‧fe  R 2-23:41:19      1 tal-gpu-labex1
              8812     labex MicrofEx gerardo.  R      24:31      1 tal-gpu-labex1

$ sbatch

If you simply run your code with srun, your job will try to use all the available resources (like in the gpu_test‧py example from Section 3 - Pytorch) . So the sbatch command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the gpu_test‧py example to use only 3 GPUs, and specifies output files for the job.

First, you will create a myfirst_gpu_job‧sh file

#!/usr/bin/env bash
#SBATCH --job-name=MyFirstJob
#SBATCH --gres=gpu:3
#SBATCH --qos=qos_gpu-t4
#SBATCH --cpus-per-task=5
#SBATCH --output=./MyFirstJob‧out
#SBATCH --error=./MyFirstJob‧err
#SBATCH --time=100:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=1
srun python available_gpus‧py

These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to MyFirstJob‧out

Then you run the script with sbatch

$ sbatch myfirst_gpu_job‧sh

$ scancel

From time to time you need to kill a job. You need to use the JOBID number from the squeue command

$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              4759      lipn my_first garciafl PD       0:00      3 (Resources)
              4760      lipn my_first garciafl PD       0:00      1 (Priority)
              4761      lipn SUPER_En   leroux PD       0:00      1 (Priority)
              4675      lipn    GGGS1 xudong‧z  R 6-20:30:00      1 lipn-rtx1
              4715      lipn SUPER_En   leroux  R 5-00:03:11      1 lipn-rtx2
              4752      lipn SUPER_En   leroux  R 2-21:37:05      1 lipn-rtx2
$ scancel 4759

Troubleshooting

Any questions about this doc, write to Jorge Garcia Flores.