Just putting this on the forum since it could be of use for some people.
I haven’t spent long using these tools but I think they offer a little more insight than provided by the MXNet Profiler, if optimising CUDA kernels and GPU performance if your thing!
Anyone used these tools before? If so, I’d be really interested to hear from you. What have you used them for and what metrics do you use the most?
Using NVIDIA’s CUDA Profiling Tools
MXNet’s Profiler is definitely the recommended starting point for profiling MXNet code, but NVIDIA also provides a couple of tools for low level profiling of CUDA code: Visual Profiler and Nsight Compute. You can use these tools to profile all kinds of executables, so they can be used for profiling Python scripts running MXNet.
Visual Profiler is avaliable in CUDA 9 and CUDA 10 toolkits. You can get a timeline view of CUDA kernel executions, and also analyse the profiling results to get automated recommendations. Seems to be the most useful for profiling end-to-end training but found the interface can get slow and unresponsive.
Nsight Compute is avaliable in CUDA 10 toolkit, but can be used to profile code running CUDA 9. You don’t get a timeline view, but you get many low level statistics about each individual kernel executed and can compare multiple runs (i.e. create a baseline). It doesn’t seem to be that useful for profiling end-to-end model training though.
Start by profiling a small section of code (e.g. a few batches) otherwise the visualizations and analysis will take much longer.
On local machine, download and install CUDA toolkit.
Go to https://developer.nvidia.com/cuda-toolkit
You only need the ‘toolkit’ and not the CUDA drivers, etc.
With CUDA versions, it seems to be possible to use profilers from CUDA 10 toolkit on remote code running CUDA 9.
CUDA 10 toolkit is required for
Start AWS EC2 instance (with GPU) using the DLAMI (CUDA drivers and toolkit pre-installed).
Allow password-based login via SSH.
Seems to be the only method to connect to remote machine with NVIDIA profilers.
Use very strong password, and use a security group with minimal source addresses (i.e. not open to world).
Using Visual Profiler
nvvpon local machine
You start the program from the terminal rather than the ‘Applications’ folder.
Select a workspace
Choose default: e.g
Create New Session
- File -> New Session
- Connection: Manage connections…
- Host name: IP address of the AWS EC2 instance. e.g. 18.104.22.1687
- Username: Username on AWS EC2 instance. e.g. ubuntu or ec2_user
- Label: Any string will do e.g. firstname.lastname@example.org
- System Type: SSH, Port number: 22
- Toolkit/Script: Manage
- Toolkit path: Browse…
- Should be path to CUDA toolkit on remote instance
- e.g. /usr/local/cuda-9.2/bin
- Update library path with defaults? Yes.
- Should be path to Python (in correct conda environment)
- Working directory:
- Should be the script you want to profile, and its arguments.
/home/ubuntu/mxnet/example/gluon/mnist/mnist.py --cuda --batch-size 100 --epochs 1
- Select ‘Profile child processes’
- Next >
- Optionally select what you need to be profiled.
- e.g. Enable CPU thread tracing
nvprofversion is different. want to proceed? Yes.
- Should start job straight away
- “Generating Timeline: Running application to generate timeline.”
- Seems to keep running >10s after script has completed.
Use analysis tab to run through diagnostics and get automated recommendations.
Using NVIDIA Nsight Compute
Couldn’t get remote execution to work with Nsight Compute (as with Visual Profiler). Can’t find remote python executable.
See in known issues: “Launching applications on remote targets/platforms is not yet supported.”
Download CUDA 10 toolkit on remote machine
You can skip the installation of the drivers, and just install the toolkit.
By default the toolkit will be installed to
nv-nsight-cu-clito collect data
Can be found at
And some useful arguments are:
-fto force overwrite of profiling file
-cto specify number of kernels to collect
-sto specify number of kernels to skip before collecting
/usr/local/cuda-10.0/NsightCompute-1.0/nv-nsight-cu-cli -f -c 10 /home/ubuntu/anaconda3/envs/mxnet_p36/bin/python /home/ubuntu/mxnet/example/gluon/mnist/mnist.py --cuda --batch-size 500 --epochs 1
You should now have a
profile.nsight-cuprof-report file in the current working directory.
Copy data to local machine
scp, Jupyter download feature, or otherwise to get the file to the local machine.
- Open ‘NVIDIA Nsight Compute’ from
- File -> Open file…
- Select the file you just transfered to the local machine.
- Compare to GPU ‘Speed of Light’ (SOL).