Use GPUs



That paragraph is very interesting "You can lose significant performance by moving data without care. A typical mistake is as follows: computing the loss for every minibatch on the GPU and reporting it back to the user on the commandline (or logging it in a NumPy array) will trigger a global interpreter lock which stalls all GPUs. It is much better to allocate memory for logging inside the GPU and only move larger logs"

Could it come with more practical recommendations or even code snippets illustrating optimal monitoring? For example:

  • Does a metric.update(label, y_pred) (where metric is an mx.metric) also incurs costly data transfer?
  • Does a"something using" metric.get()) suffer from this data transfer + gil problem?
  • should print statements be avoided at all cost in the training loop?

  • print would definitely be fatal. This wouldn’t only invoke the wrath of the GIL (global interpreter lock) but also that of the console output. Even if you were to write C++ code, output to console is a surefire way of killing performance :slight_smile: .
  • As for the metric, that is safer, but you shouldn’t log the metric to an array on the CPU if you can avoid it. Instead a much better strategy is to log it onto an array on the GPU (if that’s where you are) and then occasionally transfer to CPU.

In general what is fatal are O(n) updates and not O(1) updates. We should probably add more details about this. One of the real use cases where we saw this was in a differential privacy application where a (highly talented) scientist decided to log scalar updates to the CPU after each observation, thus killing performance. Note that this is inherent to interpreted Python code and thus hard to avoid via the framework.