Différences
Ci-dessous, les différences entre deux révisions de la page.
| Prochaine révision | Révision précédente | ||
|
public:support:labex-efl-gpu:tal-labex-gpu [2021/11/30 15:27] garciaflores créée |
public:support:labex-efl-gpu:tal-labex-gpu [2023/01/27 16:57] (Version actuelle) garciaflores |
||
|---|---|---|---|
| Ligne 1: | Ligne 1: | ||
| # Accesing the TAL/LABEX EFL GPU server | # Accesing the TAL/LABEX EFL GPU server | ||
| + | The server gives access to **8 GPUs Nvidia GEForce RTX 2080 with 8 GB of RAM each** in one node. This server is reserved to external @LipnLab [LABEX EFL](https:// | ||
| - | - **`lipn-tal-labex`** | + | ## 1. Connecting to the server |
| - | + | ||
| - | + | ||
| - | + | ||
| - | ## 1. Connecting to the Labex EFL server | + | |
| You can connect through `ssh` protocol. (More help on ssh commands [here](https:// | You can connect through `ssh` protocol. (More help on ssh commands [here](https:// | ||
| Ligne 11: | Ligne 8: | ||
| If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful [Villetaneuse](https:// | If you are physically connected to the LIPN network (that is, you are physically inside the lab at wonderful [Villetaneuse](https:// | ||
| - | ```bash | + | < |
| # INSIDE the lab command | # INSIDE the lab command | ||
| $ ssh user_name@ssh.tal.lipn.univ-paris13.fr | $ ssh user_name@ssh.tal.lipn.univ-paris13.fr | ||
| - | ``` | + | </ |
| Otherwise, if you are outside the LIPN, you should connect with the following command: | Otherwise, if you are outside the LIPN, you should connect with the following command: | ||
| - | ```bash | + | < |
| # OUTSIDE the lab command | # OUTSIDE the lab command | ||
| $ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr | $ ssh -p 60022 user_name@tal.lipn.univ-paris13.fr | ||
| - | ``` | + | </ |
| After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu | After you connected through ssh to the TAL server, you need to choose the **Labex GPU** virtual machine **(number 20)** from the login menu | ||
| - | ```bash | + | < |
| ################################################################### | ################################################################### | ||
| # Bienvenue sur le cluster TAL # | # Bienvenue sur le cluster TAL # | ||
| Ligne 47: | Ligne 44: | ||
| 7) Sdmc 14) GPU2 21) CheneTAL | 7) Sdmc 14) GPU2 21) CheneTAL | ||
| Votre choix : 20 | Votre choix : 20 | ||
| - | ``` | + | </ |
| Press enter. You should see the following message: | Press enter. You should see the following message: | ||
| - | ```bash | + | < |
| Warning: Permanently added ' | Warning: Permanently added ' | ||
| Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 | Linux tal-gpu-login 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 | ||
| Ligne 67: | Ligne 64: | ||
| Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3 | Last login: Thu Nov 25 17:58:00 2021 from 192.168.69.3 | ||
| user_name@tal-gpu-login: | user_name@tal-gpu-login: | ||
| - | ``` | + | </ |
| - | **For the moment, you need to source manually the `.bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment | + | **For the moment, you need to source manually the `‧bashrc` file of your NFS home every time you connect to the Labex server, in order to activate your *miniconda* GPU environment |
| - | ```bash | + | < |
| user_name@tal-gpu-login: | user_name@tal-gpu-login: | ||
| - | ``` | + | </ |
| Once you install *miniconda* and *Pytorch* (after you run section 3 commands), you will see the base miniconda prompt. | Once you install *miniconda* and *Pytorch* (after you run section 3 commands), you will see the base miniconda prompt. | ||
| - | ```bash | + | < |
| (base) user_name@tal-gpu-login: | (base) user_name@tal-gpu-login: | ||
| - | ``` | + | </ |
| Ligne 89: | Ligne 86: | ||
| We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad: | We use the `scp` command to copy data to/from the LABEX EFL GPU server. Once again, the command changes depending if you are into the LIPN network or abroad: | ||
| - | ```bash | + | < |
| # INSIDE the lab commands | # INSIDE the lab commands | ||
| # copying one file from your computer to the Labex server | # copying one file from your computer to the Labex server | ||
| - | $ scp my_file.txt user_name@ssh.tal.univ-paris13.fr: | + | $ scp my_file‧txt user_name@ssh.tal.univ-paris13.fr: |
| # copying a whole folder | # copying a whole folder | ||
| $ scp -r local_folder user_name@ssh.tal.univ-paris13.fr: | $ scp -r local_folder user_name@ssh.tal.univ-paris13.fr: | ||
| - | ``` | + | </ |
| To copy data back to your local computer : | To copy data back to your local computer : | ||
| - | ```bash | + | < |
| # INSIDE the lab commands | # INSIDE the lab commands | ||
| my_user@my_local_computer: | my_user@my_local_computer: | ||
| - | ``` | + | </ |
| And from outside the lab : | And from outside the lab : | ||
| - | ```bash | + | < |
| # OUTSIDE the lab commands | # OUTSIDE the lab commands | ||
| # copying files | # copying files | ||
| - | $ scp -P 60022 my_file.txt user_name@lipnssh.univ-paris13.fr:~/ | + | $ scp -P 60022 my_file‧txt user_name@tal.lipn.univ-paris13.fr |
| # copying folders recursevly | # copying folders recursevly | ||
| - | $ scp -r local_folder user_name@tal.lipn.univ-paris13.fr: | + | $ scp -P 60022 -r local_folder user_name@tal.lipn.univ-paris13.fr: |
| - | ``` | + | </ |
| Any data that you need to copy back from the server to your computer must be copied to your NFS home: | Any data that you need to copy back from the server to your computer must be copied to your NFS home: | ||
| - | ```bash | + | < |
| #OUTSIDE the lab commands | #OUTSIDE the lab commands | ||
| - | user_name@lipn-tal-labex: | + | user_name@lipn-tal-labex: |
| user_name@lipn-tal-labex: | user_name@lipn-tal-labex: | ||
| my_user@my_local_computer: | my_user@my_local_computer: | ||
| - | ``` | + | </ |
| Ligne 128: | Ligne 125: | ||
| Check the [Miniconda documentation](https:// | Check the [Miniconda documentation](https:// | ||
| - | ```bash | + | < |
| $ wget https:// | $ wget https:// | ||
| $ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh | $ sh Miniconda3-py37_4.10.3-Linux-x86_64.sh | ||
| - | ``` | + | </ |
| The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed. | The installation script will run. You must type space until the end of the license agreement, and then write `yes` to proceed. | ||
| - | ```bash | + | < |
| Welcome to Miniconda3 py37_4.10.3 | Welcome to Miniconda3 py37_4.10.3 | ||
| Ligne 152: | Ligne 149: | ||
| Do you accept the license terms? [yes|no] | Do you accept the license terms? [yes|no] | ||
| [no] >>> | [no] >>> | ||
| - | ``` | + | </ |
| Choose the installation path on your NFS home | Choose the installation path on your NFS home | ||
| - | ```bash | + | < |
| Miniconda3 will now be installed into this location: | Miniconda3 will now be installed into this location: | ||
| / | / | ||
| Ligne 165: | Ligne 162: | ||
| [/ | [/ | ||
| - | ``` | + | </ |
| - | Now you will be asked if you want to add *conda* base environment in your `.bashrc` file. Answer yes. | + | Now you will be asked if you want to add *conda* base environment in your `‧bashrc` file. Answer yes. |
| - | ```bash | + | < |
| Preparing transaction: | Preparing transaction: | ||
| Executing transaction: | Executing transaction: | ||
| Ligne 176: | Ligne 173: | ||
| by running conda init? [yes|no] | by running conda init? [yes|no] | ||
| [no] >>> | [no] >>> | ||
| - | ``` | + | </ |
| - | Source manually your `.bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch. | + | Source manually your `‧bashrc` file on your NFS home to activate the *miniconda* environment before installing Pytorch. |
| - | ```bash | + | < |
| user_name@lipn-tal-labex: | user_name@lipn-tal-labex: | ||
| user_name@lipn-tal-labex: | user_name@lipn-tal-labex: | ||
| (base) user_name@lipn-tal-labex: | (base) user_name@lipn-tal-labex: | ||
| - | ``` | + | </ |
| Ligne 192: | Ligne 189: | ||
| Install the [Pytorch](https:// | Install the [Pytorch](https:// | ||
| - | ```bash | + | < |
| (base) user_name@lipn-tal-labex: | (base) user_name@lipn-tal-labex: | ||
| [...] | [...] | ||
| Ligne 204: | Ligne 201: | ||
| Proceed ([y]/n)? y | Proceed ([y]/n)? y | ||
| - | ``` | + | </ |
| (Type `y`to proceed). After a while, you need to test your Pytorch install. | (Type `y`to proceed). After a while, you need to test your Pytorch install. | ||
| - | To test it, create the following `gpu_test.py` program with your favorite editor | + | To test it, create the following `gpu_test‧py` program with your favorite editor |
| - | ```python | + | </ |
| # Python program to count GPU cards in the server using Pytorch | # Python program to count GPU cards in the server using Pytorch | ||
| import torch | import torch | ||
| - | available_gpus = [torch.cuda.device(i) for i in range(torch.cuda.device_count())] | + | available_gpus = [torch‧cuda‧device(i) for i in range(torch‧cuda‧device_count())] |
| for gpu in available_gpus: | for gpu in available_gpus: | ||
| print(gpu) | print(gpu) | ||
| - | ``` | + | </ |
| Then run it with the *SLURM* command `srun` | Then run it with the *SLURM* command `srun` | ||
| - | ```bash | + | < |
| - | (base) user_name@lipn-tal-labex: | + | (base) user_name@lipn-tal-labex: |
| - | <torch.cuda.device object at 0x7f29f0602d10> | + | <torch‧cuda‧device object at 0x7f29f0602d10> |
| - | <torch.cuda.device object at 0x7f29f0602d90> | + | <torch‧cuda‧device object at 0x7f29f0602d90> |
| - | <torch.cuda.device object at 0x7f29f0602e90> | + | <torch‧cuda‧device object at 0x7f29f0602e90> |
| - | <torch.cuda.device object at 0x7f29f0618cd0> | + | <torch‧cuda‧device object at 0x7f29f0618cd0> |
| - | <torch.cuda.device object at 0x7f29f0618d10> | + | <torch‧cuda‧device object at 0x7f29f0618d10> |
| - | <torch.cuda.device object at 0x7f29f0618d90> | + | <torch‧cuda‧device object at 0x7f29f0618d90> |
| - | <torch.cuda.device object at 0x7f29f0618dd0> | + | <torch‧cuda‧device object at 0x7f29f0618dd0> |
| - | <torch.cuda.device object at 0x7f29f0618e10> | + | <torch‧cuda‧device object at 0x7f29f0618e10> |
| - | ``` | + | </ |
| ## 4. Using `slurm` to run your code | ## 4. Using `slurm` to run your code | ||
| Ligne 236: | Ligne 233: | ||
| [Slurm](https:// | [Slurm](https:// | ||
| - | ### `srun` | + | ### $ srun |
| This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`. | This is the basic command for running jobs in Slurm. This example shows how to check the GPU models you are using and the CUDA version running `nvidia-smi`command with `srun`. | ||
| - | ```bash | + | < |
| $ srun nvidia-smi | $ srun nvidia-smi | ||
| Mon Nov 29 13:27:00 2021 | Mon Nov 29 13:27:00 2021 | ||
| Ligne 268: | Ligne 265: | ||
| +-----------------------------------------------------------------------------+ | +-----------------------------------------------------------------------------+ | ||
| - | ``` | + | </ |
| **You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`** | **You can use it to run Python code, but as you are working in a shared server, it is better to run your code with `sbatch`** | ||
| - | ### `sinfo` and `scontrol` | + | ### $ sinfo / scontrol |
| This command shows how many nodes are available in the server. | This command shows how many nodes are available in the server. | ||
| - | ```bash | + | < |
| $ sinfo | $ sinfo | ||
| PARTITION AVAIL TIMELIMIT | PARTITION AVAIL TIMELIMIT | ||
| labex* | labex* | ||
| - | ``` | + | </ |
| If you want to check the particular configuration of a node, use `scontrol` | If you want to check the particular configuration of a node, use `scontrol` | ||
| - | ```bash | + | < |
| $ scontrol show node tal-gpu-labex1 | $ scontrol show node tal-gpu-labex1 | ||
| NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10 | NodeName=tal-gpu-labex1 Arch=x86_64 CoresPerSocket=10 | ||
| Ligne 299: | Ligne 296: | ||
| | | ||
| | | ||
| - | ``` | + | </ |
| - | ### `squeue` | + | ### $ squeue |
| If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with . | If the server is full, your job will be put in wait on a queue by Slurm. You can check the queue state with . | ||
| - | ```bash | + | < |
| $ squeue | $ squeue | ||
| JOBID PARTITION | JOBID PARTITION | ||
| - | 8795 labex QKVRegLA ghazi.fe R 2-23: | + | 8795 labex QKVRegLA ghazi‧fe R 2-23: |
| - | 8796 labex QKVRegLA ghazi.fe R 2-23: | + | 8796 labex QKVRegLA ghazi‧fe R 2-23: |
| 8812 labex MicrofEx gerardo. | 8812 labex MicrofEx gerardo. | ||
| - | ``` | + | </ |
| - | ### `sbatch` | + | ### $ sbatch |
| - | If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test.py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test.py` example to use only 3 GPUs, and specifies output files for the job. | + | If you simply run your code with `srun`, your job will try to use all the available resources (like in the `gpu_test‧py` example from Section 3 - Pytorch) . So the `sbatch` command is useful to configure inputs, outputs and resource requirements for your job. The following example configures the `gpu_test‧py` example to use only 3 GPUs, and specifies output files for the job. |
| - | First, you will create a `myfirst_gpu_job.sh` file | + | First, you will create a `myfirst_gpu_job‧sh` file |
| - | ```bash | + | < |
| # | # | ||
| #SBATCH --job-name=MyFirstJob | #SBATCH --job-name=MyFirstJob | ||
| Ligne 324: | Ligne 321: | ||
| #SBATCH --qos=qos_gpu-t4 | #SBATCH --qos=qos_gpu-t4 | ||
| #SBATCH --cpus-per-task=5 | #SBATCH --cpus-per-task=5 | ||
| - | #SBATCH --output=./ | + | #SBATCH --output=./ |
| - | #SBATCH --error=./ | + | #SBATCH --error=./ |
| #SBATCH --time=100: | #SBATCH --time=100: | ||
| #SBATCH --nodes=1 | #SBATCH --nodes=1 | ||
| #SBATCH --cpus-per-task=5 | #SBATCH --cpus-per-task=5 | ||
| #SBATCH --ntasks-per-node=1 | #SBATCH --ntasks-per-node=1 | ||
| - | srun python available_gpus.py | + | srun python available_gpus‧py |
| - | ``` | + | </ |
| - | These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob.out` | + | These parameters specify a job to be run on 1 node, 3 GPUs and in a maximum time of 100 hours. Normal output will be sent to `MyFirstJob‧out` |
| Then you run the script with `sbatch` | Then you run the script with `sbatch` | ||
| - | ```bash | + | < |
| - | $ sbatch myfirst_gpu_job.sh | + | $ sbatch myfirst_gpu_job‧sh |
| - | ``` | + | </ |
| - | ### `scancel` | + | ### $ scancel |
| From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command | From time to time you need to kill a job. You need to use the `JOBID` number from the `squeue` command | ||
| - | ```bash | + | < |
| $ squeue | $ squeue | ||
| JOBID PARTITION | JOBID PARTITION | ||
| Ligne 351: | Ligne 348: | ||
| 4760 lipn my_first garciafl PD | 4760 lipn my_first garciafl PD | ||
| 4761 lipn SUPER_En | 4761 lipn SUPER_En | ||
| - | 4675 lipn GGGS1 xudong.z R 6-20: | + | 4675 lipn GGGS1 xudong‧z R 6-20: |
| 4715 lipn SUPER_En | 4715 lipn SUPER_En | ||
| 4752 lipn SUPER_En | 4752 lipn SUPER_En | ||
| $ scancel 4759 | $ scancel 4759 | ||
| - | ``` | + | </ |
| + | |||
| + | |||
| + | ## Troubleshooting | ||
| + | Any questions about this doc, write to [Jorge Garcia Flores](mailto: | ||