Free AI web copilot to create summaries, insights and extended knowledge, download it at here
5610
Abstract
ng.spawn function to spawn args.world_size processes. To keep things organized and customizable, we can use argparse.</p>
<figure id="98db">
<div>
<div>
<iframe class="gist-iframe" src="/gist/ramyamounir/6360562e619399fd997842828c4dc929.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="a01c">It is very important to distinguish the code running on the main process from the code running on the children processes. For example, we have to define the CUDA_VISIBLE_DEVICES on the main process before we spawn the train function on the GPUs. We also define the dist_url for all GPUs to communicate back to the main process. Since we are running locally on a single node, the URL can be localhost and a random port. This URL will be given to all GPUs so that the main process can be reachable from any process.</p>
<figure id="d171">
<div>
<div>
<iframe class="gist-iframe" src="/gist/ramyamounir/52553303aff87ebc7522ab79f625a102.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="479f">The code running on the child process (on the GPU) will have specific initialization variables, such as the local rank. The torch.distributed.init_process_group does all the heavy work; it initializes the communication between processes and waits until it makes sure they can talk to each other. We set the seed on all GPUs to be the same for random initializations of parameters. The rest is to get the dataset loader, model, loss function and pass them to the trainer function.</p><p id="10f8">This code can be easily modified to run this training on multiple nodes, not necessarily on the same cluster. You will need to manually change the “localhost” in args.dist_url to Noda A address and set the args.world_size to the total number of GPUs you intend to use from all nodes. You also need to pass the node rank to the train function, which can be added to the local rank to get the global rank of the GPU, as shown below.</p><div id="f34b"><pre><span class="hljs-built_in">args</span>.<span class="hljs-built_in">rank</span> = <span class="hljs-built_in">args</span>.node_rank + gpu</pre></div><p id="f2be">That’s it; you can now run multiple copies of this script on different nodes and see the training happening on all GPUs simultaneously.</p><h1 id="e7ca">Data Loader</h1><p id="a584">As we mentioned before, the dataset needs to be split into chunks, where the total number of chunks should be equal to args.world_size. The DistributedSampler class can easily do that for us. You simply just need to define your dataset and pass it as an argument to the DistributedSampler class along with other parameters, such as world_size and the global_rank of the current process. The output will be a sampler object that you can pass to the DataLoader class.</p>
<figure id="8f16">
<div>
<div>
<iframe class="gist-iframe" src="/gist/ramyamounir/bdd57c88ea95c949e6deebab4120d700.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><h1 id="2988">Get Model, Loss, and Trainer!</h1><p id="52b1">The rest is as easy as defining your architecture and wrapping it in the nn.parallel.DistributedDataParallel class, defining your loss function, and starting to train! :)</p>
<figure id="eb2b">
<div>
<div>
<iframe class="gist-iframe" src="/gist/ramyamounir/9bd34d518fc2dd1bdda8b83dc25606b8.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><h1 id="782b">Slurm</h1><p id="7bad">Slurm is a job scheduler used on clusters to accept job submission files and schedule them when the requested resources become available. The usual procedure is to create a separate script file with Slurm-specific arguments:</p><div id="0049"><pre><span class="hljs-meta">#!/bin/bash</span>
<span class="hljs-comment">#SBATCH -w "[node_name]"</span>
<span class="hljs-comment">#SBATCH -p [partition]</span>
<span class="hljs-comment">#SBATCH --mem=100GB</span>
srun python train.py</pre></div><p id="2118">And “submit it” with sbatch as follows:</p><div id="e09e"><pre>sbatch <span class="hljs-keyword">script</span>.sh</pre></div><p id="3e7a">While you can follow the above steps and get it to do what you want, there is an easier way by utilizing a library called “<a href="https://github.com/facebookincubator/submitit">Submitit</a>” that was recently open-sourced by Facebook AI Research (FAIR). The idea is to use Submitit to generate and submit the job script for us. We can easily define how many nodes and the number of GPUs on each node. We can even define a function to re-submit the job if it gets preempted for any reason.</p><h1 id="d131">Submitit</h1><p id="e2a7">To generate and submit jobs to Slurm using Submitit, we need to get a submitit.AutoExecutor object. We can use the function submitit.AutoExecutor.update_parameters to provide Slurm-specific parameters. Submitit will take care of spawning the different processes on the GPUs (even if on different nodes).</p>
<figure id="6c47">
Options
<div>
<div>
<iframe class="gist-iframe" src="/gist/ramyamounir/f24de13c00aff2de4fd37693de9651f0.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="4a46">As seen in the code above, we can define a slurm_trainer class and pass an instance of this class to the executor.submit function. This submit function will spawn the __call__ function of the slurm_trainer instance on multiple GPUs as defined in the executer parameters. The slurm_trainer still calls the train function, which gets the dataset, model, loss function, and starts the training. Note: You won't need to define args.gpu and args.rank in the train function anymore because they are now defined on lines 44 and 45. The provided <a href="https://github.com/ramyamounir/Template">template</a> combines training with Slurm and training locally in one script.</p><h1 id="a57c">Tensorboard</h1><p id="a7cb">The utils directory in the template provides some helper functions for automatically starting Tensorboard writer and server. When running the script on a remote server, I recommend starting an Ngrok server that forwards port 6006 to an Ngrok domain. The Ngrok domain can be accessed from anywhere, even on your smartphone, to check your training progress.</p><h1 id="7653">Using the Template</h1><p id="36cc">The template follows a modular approach where the main components of the code (architecture, loss, scheduler, trainer, etc.) are organized into subdirectories.</p><ul><li>The <a href="https://github.com/ramyamounir/Template/blob/main/train.py">train.py</a> script contains all the arguments (parsed by argparse) and nodes/GPUs initializer (slurm or local). It also contains code for importing the dataset, model, loss function and passing them to the trainer function.</li><li>The lib/trainer/trainer.py script defines the details of the training procedure.</li><li>The lib/dataset/[args.dataset].py imports data and defines the dataset function. Creating a data directory with a soft link to the dataset is recommended, especially for testing on multiple datasets.</li><li>The lib/core/ directory contains definitions for loss, optimizer, scheduler functions.</li><li>The lib/utils/ directory contains helper functions organized by file name. (i.e., helper functions for distributed training are placed in the lib/utils/distributed.py file).</li></ul><p id="2d2e"><b>For single node, single GPU training, try:</b></p><div id="d74f"><pre><span class="hljs-keyword">python</span> train.<span class="hljs-keyword">py</span> -gpus <span class="hljs-number">0</span></pre></div><p id="6dec"><b>For single node, multi GPU training, try:</b></p><div id="615b"><pre><span class="hljs-attribute">python</span> train.py -gpus <span class="hljs-number">0</span>,<span class="hljs-number">1</span>,<span class="hljs-number">2</span></pre></div><p id="98ca"><b>For single node, multi GPU training on SLURM, try:</b></p><div id="4d36"><pre><span class="hljs-keyword">python</span> train.<span class="hljs-keyword">py</span> -slurm -slurm_nnodes <span class="hljs-number">1</span> -slurm_ngpus <span class="hljs-number">4</span>
-slurm_partition general</pre></div><p id="d0d4"><b>For multi node, multi GPU training on SLURM, try:</b></p><div id="5bd1"><pre>python train.py -slurm -slurm_nnodes<span class="hljs-number"> 2 </span>-slurm_ngpus<span class="hljs-number"> 8 </span>
-slurm_partition general</pre></div><p id="c988"><b>Tips:</b></p><ul><li>To get more information about available arguments, run: <code>python train.py -h</code></li><li>To automatically start the Tensorboard server as a different thread, add the argument: <code>-tb</code></li><li>To overwrite model log files and start from scratch, add the argument: <code>-reset</code>, otherwise, it will use the last weights as a checkpoint and continue writing to the same tensorboard log files - if the same model name is used.</li><li>To choose specific node names on SLURM, use the argument: <code>-slurm_nodelist GPU17,GPU18</code> as an example.</li><li>If running on a GPU with Tensor cores, using mixed precision models can speed up your training. Add the argument <code>-fp16</code> to try it out. If it makes training unstable due to the loss of precision, don't use it :)</li><li>The template allows you to switch architectures, datasets, and trainers easily by passing different arguments. For example, different architectures can be added to the lib/arch/[arch-name].py directory and passing the arguments as <code>-arch [arch-name]</code> or <code>-trainer [trainer-name]</code> or <code>-dataset [dataset-name]</code></li><li>The stdout and stderr will be printed in the shared directory. We only print the first GPU output. Make sure to change the shared directory in lib/utils/distributed.py depending on the cluster you are using.</li></ul><h1 id="5058">Conclusion</h1><p id="a454">In this article, we went over how to distribute your training using DDP on multiple GPUs in few easy steps. The main difference between DDP and DP is defining communication parameters, such as world_size, ranks, and URL. We also went over Slurm and how to automate the script generation process using Submitit. Both Slurm-based jobs and locally-trained jobs are combined under one easy-to-use template. Please let me know if you face problems by commenting here or opening an issue on the <a href="https://github.com/ramyamounir/Template">Template</a> repository.</p><p id="f856">Happy Coding :)</p></article></body>