
Accesing the TAL/LABEX EFL GPU server

  • lipn-tal-labex provides access to 8 GPUs Nvidia GEForce RTX 2080 with 8GB of RAM each in one node. You need to write an email to Jorge Garcia Flores to ask for a tal-lipn account in order to get access to this server. (Spoiler alert: for the moment, a standard LIPN intranet account is useless for this server: you really need a tal-lipn account to gain access this part of the network).

You can connect through ssh protocol. (More help on ssh commands here. )

If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful Villetaneuse) type the following command :

# INSIDE the lab command
$ ssh

Otherwise, if you are outside the LIPN, you should connect with the following command:

# OUTSIDE the lab command
$ ssh -p 60022

After you connected through ssh to the TAL server, you need to choose the Labex GPU virtual machine (number 20) from the login menu

# Bienvenue sur le cluster TAL                                    #
#                                                                 #
# Cette machine vous propose plusieurs services dédiés au calcul. #
# Tal est un ensemble de machines stand-alone destinées à du      #
# développement ou à des tâches légères.                          #
# La liste des noeuds utilisables est visible ci-dessous.         #
#                                                                 #
# Pour toute remarque, ecrivez a     #
#                                                                 #
1) Wiki             8) Inception      15) CPU            22) BNI2
2) Redmine          9) Eswc           16) Citeseer       23) LexEx
3) GPU             10) Wirarika       17) SolR           24) Stanbol
4) Neoveille       11) Kilroy         18) Morfetik       25) Quitter TAL
5) Hybride         12) Bni            19) Ganso
6) Unoporuno       13) Cartographies  20) LabexGPU
7) Sdmc            14) GPU2           21) CheneTAL
Votre choix : 20

Press enter. You should see the following message:

Warning: Permanently added '' (ECDSA) to the list of known hosts.
Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64
       Bienvenue sur le serveur GPU du LABEX
Cette machine utilise le gestionnaire de tâches SLURM. 
Pour utiliser les ressources CPU et GPU de la machine, 
veuillez utiliser les commandes sbatch et srun. 
Pour plus d'informations, visitez le site :
Last login: Thu Nov 25 17:58:00 2021 from

For the moment, you need to source manually the .bashrc file of your NFS home every time you connect to the Labex server, in order to activate your miniconda GPU environment (see section 3). So, each time you login, you need to type

user_name@tal-gpu-login:~$ source .bashrc

Once you install miniconda and Pytorch (after you run section 3 commands), you will see the base miniconda prompt.

(base) user_name@tal-gpu-login:~$

You have a NFS home in the server associated to your user. You will copy your training data and your output models to/from this home, which is always accessible from the GPU server.

We use the scp command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad:

# INSIDE the lab commands
# copying one file from your computer to the Labex server
$ scp my_file.txt
# copying a whole folder 
$ scp -r local_folder

To copy data back to your local computer :

# INSIDE the lab commands
my_user@my_local_computer:~$ scp .

And from outside the lab :

# OUTSIDE the lab commands
# copying files 
$ scp -P 60022 my_file.txt
# copying folders recursevly 
$ scp -r local_folder

Any data that you need to copy back from the server to your computer must be copied to your NFS home:

#OUTSIDE the lab commands
user_name@lipn-tal-labex:~$ cp any_file.txt /users/username/my_folder/
user_name@lipn-tal-labex:~$ exit
my_user@my_local_computer:~$ scp -P 60022 .

Before running python code on the GPU servers, you need to install Miniconda and Pytorch.

Check the Miniconda documentation to get the link of the latest Linux 64-bit miniconda installer to use with the wgetcommand.

$ wget
$ sh

The installation script will run. You must type space until the end of the license agreement, and then write yes to proceed.

Welcome to Miniconda3 py37_4.10.3
In order to continue the installation process, please review the license
Please, press ENTER to continue
End User License Agreement - Miniconda
Copyright 2015-2021, Anaconda, Inc.
All rights reserved under the 3-clause BSD License:
Do you accept the license terms? [yes|no]
[no] >>> yes

Choose the installation path on your NFS home

Miniconda3 will now be installed into this location:
  - Press ENTER to confirm the location
  - Press CTRL-C to abort the installation
  - Or specify a different location below
[/home/username/miniconda3] >>> /home/username/code/python/miniconda3

Now you will be asked if you want to add conda base environment in your .bashrc file. Answer yes.

Preparing transaction: done
Executing transaction: done
installation finished.
Do you wish the installer to initialize Miniconda3
by running conda init? [yes|no]
[no] >>> yes

Source manually your .bashrc file on your NFS home to activate the miniconda environment before installing Pytorch.

user_name@lipn-tal-labex:~$ cd ~
user_name@lipn-tal-labex:~$ source .bashrc
(base) user_name@lipn-tal-labex:~$

Install the Pytorch framework with in your base miniconda environment

(base) user_name@lipn-tal-labex:~$ conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
  pillow             pkgs/main/linux-64::pillow-8.4.0-py37h5aabda8_0
  pytorch            pytorch/linux-64::pytorch-1.10.0-py3.7_cpu_0
  pytorch-mutex      pytorch/noarch::pytorch-mutex-1.0-cpu
  torchaudio         pytorch/linux-64::torchaudio-0.10.0-py37_cpu
  torchvision        pytorch/linux-64::torchvision-0.11.1-py37_cpu
  typing_extensions  pkgs/main/noarch::typing_extensions-
  zstd               pkgs/main/linux-64::zstd-1.4.9-haebb681_0
Proceed ([y]/n)? y

(Type yto proceed). After a while, you need to test your Pytorch install.

To test it, create the following program with your favorite editor


Python program to count GPU cards in the server using Pytorch

import torch availablegpus = [torch.cuda.device(i) for i in range(torch.cuda.devicecount())] for gpu in available_gpus:



Then run it with the SLURM command srun

(base) user_name@lipn-tal-labex:~$ srun python3
<torch.cuda.device object at 0x7f29f0602d10>
<torch.cuda.device object at 0x7f29f0602d90>
<torch.cuda.device object at 0x7f29f0602e90>
<torch.cuda.device object at 0x7f29f0618cd0>
<torch.cuda.device object at 0x7f29f0618d10>
<torch.cuda.device object at 0x7f29f0618d90>
<torch.cuda.device object at 0x7f29f0618dd0>
<torch.cuda.device object at 0x7f29f0618e10>

Slurm is the Linux workload manager we use at LIPN to schedule and queue GPU jobs.

This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running nvidia-smicommand with srun.

$ srun nvidia-smi
Mon Nov 29 13:27:00 2021       
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA GeForce ...  On   | 00000000:1A:00.0 Off |                  N/A |
| 26%   32C    P8    13W / 215W |      1MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
|   1  NVIDIA GeForce ...  On   | 00000000:1B:00.0 Off |                  N/A |
| 27%   31C    P8    19W / 215W |      1MiB /  7982MiB |      0%      Default |
|                               |                      |                  N/A |
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|    4   N/A  N/A     38044      C   /usr/bin/python                  8941MiB |
|    6   N/A  N/A     38185      C   /usr/bin/python                  3889MiB |

You can use it to run Python code, but as you are working in a shared server, it is better to run your code with sbatch

This command shows how many nodes are available in the server.

$ sinfo 
labex*       up   infinite      1    mix tal-gpu-labex1

If you want to check the particular configuration of a node, use scontrol

$ scontrol show node tal-gpu-labex1
NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10 
   CPUAlloc=12 CPUTot=40 CPULoad=2.19
   NodeAddr= NodeHostName=tal-gpu-labex1 Version=19.05.5
   OS=Linux 4.19.0-17-amd64 #1 SMP Debian 4.19.194-3 (2021-07-18) 
   RealMemory=128553 AllocMem=0 FreeMem=169091 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   BootTime=2021-08-02T15:48:18 SlurmdStartTime=2021-08-02T15:48:35
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with .

$ squeue
              8795     labex QKVRegLA ghazi.fe  R 2-23:50:30      1 tal-gpu-labex1
              8796     labex QKVRegLA ghazi.fe  R 2-23:41:19      1 tal-gpu-labex1
              8812     labex MicrofEx gerardo.  R      24:31      1 tal-gpu-labex1

If you simply run your code with srun, your job will try to use all the available resources (like in the example from Section 3 - Pytorch) . So the sbatch command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the example to use only 3 GPUs, and specifies output files for the job.

First, you will create a file

#!/usr/bin/env bash
#SBATCH --job-name=MyFirstJob
#SBATCH --gres=gpu:3
#SBATCH --qos=qos_gpu-t4
#SBATCH --cpus-per-task=5
#SBATCH --output=./MyFirstJob.out
#SBATCH --error=./MyFirstJob.err
#SBATCH --time=100:00:00
#SBATCH --nodes=1
#SBATCH --cpus-per-task=5
#SBATCH --ntasks-per-node=1
srun python

These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to MyFirstJob.out

Then you run the script with sbatch

$ sbatch

From time to time you need to kill a job. You need to use the JOBID number from the squeue command

$ squeue
              4759      lipn my_first garciafl PD       0:00      3 (Resources)
              4760      lipn my_first garciafl PD       0:00      1 (Priority)
              4761      lipn SUPER_En   leroux PD       0:00      1 (Priority)
              4675      lipn    GGGS1 xudong.z  R 6-20:30:00      1 lipn-rtx1
              4715      lipn SUPER_En   leroux  R 5-00:03:11      1 lipn-rtx2
              4752      lipn SUPER_En   leroux  R 2-21:37:05      1 lipn-rtx2
$ scancel 4759
