How to use profiler for distributed training?

rahul003 · October 27, 2017, 9:35am

When I try to use MXNet’s profiler while running distributed training it doesn’t log the events into the json file. I set these environment variables MXNET_PROFILER_AUTOSTART=1 MXNET_PROFILER_MODE=1 to start the profiler. This just creates a 16KB json file which contains a list of events like below. No actual operators are shown in the output.

traceEvents": [
        {
            "ph": "M",
            "args": {
                "name": "cpu/0"
            },
            "pid": 0,
            "name": "process_name"
        },
...

I’ve also tried to set the profiler like this ‘mx.profiler.profiler_set_state(‘run’)’. That doesn’t work too.

This happens even when the job was launched locally using launch.py with launcher local option. This was the command I used, if someone wants to recreate it:

cd example/image-classification && ../../tools/launch.py -n 1 --launcher local python train_imagenet.py --benchmark 1 --kv-store dist_sync --gpus 0,1,2,3 --network mlp --batch-size 256 --num-epochs 1

I’d like to note that I have compiled with USE_PROFILER=1 and can profile jobs which are run locally without launch.py

How to use the profiler correctly for distributed training?

anirudh2290 · October 28, 2017, 8:50pm

I was able to create the profiler json output file (8MB in size) with the command that you provided on a single GPU. Are you seeing this issue only when you pass in multiple GPUs as the argument?

rahul003 · November 29, 2017, 10:09pm

An update if someone else is trying to use profiler for distributed training too.
Using MXNET_PROFILER_AUTOSTART=1 does create a json file but it is corrupted and incomplete. I suspect that both server and worker try to profile with the same file and this messes with the file. Setting in code using ‘mx.profiler…’ works but only profiles the worker process on a machine. Profiling the server needs something else.

Topic		Replies	Views
Performance of distributed training using dist_sync kv_store Performance	1	472	March 13, 2020
Question about the network traffic pattern during distributed learning Discussion	1	525	April 12, 2018
Memory profiling for MxNet Performance	4	1462	October 11, 2017
Rcnn forward slow during distributed training 0.12 Performance	4	649	February 27, 2018
Performance regression in 1.4	1	266	December 22, 2020

How to use profiler for distributed training?

Related Topics