Hello all. I am trying to deploy my MXNet-based application to the web. I want to be able to handle as many web requests as possible. I am using two inception v3 models per request to provide an image classification. The problem is that web servers often like to provide multiple threads to handle multiple requests at a time, but when load testing my application, I notice that requests are taking a long time to complete.
What are some good suggestions to increase performance when deploying an MXNet model to the web? Should I create a pool of Module objects to handle classification? Should I use one Module object for all requests? Which objects should I cache instead of re-creating each request?