How to use profiler for distributed training?


#1

When I try to use MXNet’s profiler while running distributed training it doesn’t log the events into the json file. I set these environment variables MXNET_PROFILER_AUTOSTART=1 MXNET_PROFILER_MODE=1 to start the profiler. This just creates a 16KB json file which contains a list of events like below. No actual operators are shown in the output.

traceEvents": [
        {
            "ph": "M",
            "args": {
                "name": "cpu/0"
            },
            "pid": 0,
            "name": "process_name"
        },
...

I’ve also tried to set the profiler like this ‘mx.profiler.profiler_set_state(‘run’)’. That doesn’t work too.

This happens even when the job was launched locally using launch.py with launcher local option. This was the command I used, if someone wants to recreate it:

cd example/image-classification && ../../tools/launch.py -n 1 --launcher local python train_imagenet.py --benchmark 1 --kv-store dist_sync --gpus 0,1,2,3 --network mlp --batch-size 256 --num-epochs 1

I’d like to note that I have compiled with USE_PROFILER=1 and can profile jobs which are run locally without launch.py

How to use the profiler correctly for distributed training?


#2

I was able to create the profiler json output file (8MB in size) with the command that you provided on a single GPU. Are you seeing this issue only when you pass in multiple GPUs as the argument?


#3

An update if someone else is trying to use profiler for distributed training too.
Using MXNET_PROFILER_AUTOSTART=1 does create a json file but it is corrupted and incomplete. I suspect that both server and worker try to profile with the same file and this messes with the file. Setting in code using ‘mx.profiler…’ works but only profiles the worker process on a machine. Profiling the server needs something else.