Reuse memory of mxnet::cpp::NDArray


#1

How could I reuse memory of the NDArray?
According to the example, when I want to copy the memory of opencv to NDArray, I have to create a new NDArray, reallocate new buffer.

Is it possible to ask the NDArray reuse the float pointer directly, or copy the data of float pointer into the memory of old NDArray, which already allocate the memory?


#2

No it is not possible. What are you trying to do that makes you concerned about the memory usage?


#3

Because reallocate memory every times is not necessary, this cost could be avoided by users easily if the NDArray provide the api like the cv::Mat of opencv.

Premature optimization is evil, but we do not need to become pessimistic either.


#4

MXNet has an asynchronous execution engine. In a typical training setting, your training is done on GPU and data preprocessing is done on CPU. Because of the asynchronous nature of MXNet, preprocessing of the next batch of data can happen in parallel to graph computation of the current batch and the cost of an extra memcpy during preprocessing would not impact your training performance.

Also keep in mind that every single neural network operator in MXNet results in a memcpy after computation. The initial memcpy from cv to mxnet is going to be very negligible compared to even the smallest convolutional network.


#5

The initial memcpy from cv to mxnet is going to be very negligible compared to even the smallest convolutional network.

Not for training, but want to save some extra cost when doing inference. Even it is very negligible for performance, but it is not a bad things to avoid the cost if the api is easy enough to use.

By the way, do c++ api of mxnet support arbitrary batch size when doing inference?
Unlike the issue of reallocate memory/copy, I thinkbatch size do have big impact on performance

Thanks for your helps.


#6

Arbitrary batch-size is supported, but every time batch-size changes, a new allocation for the network happens which slows down inference. You can consider having a few batch-size buckets to avoid memory allocation for each new batch-size.


#7

Any plan to avoid reallocation of the network?

Not that practical because

  1. memory of gpu/cpu are limited
  2. impossible to predict how many faces/persons/etc would appear in the frame at runtime

Is this problem hard to solve?Like need to change a lot of codes, would have big impact on the architectures etc?


#8

The only way to avoid memory reallocation is by having the network allocate memory for the largest possible batch-size and reuse that same memory when batch-size is smaller.

If you use the Gluon API, calling HybridBlock.hybridize(static_alloc=True) will do exactly that. With CPP API, AFAIK, there isn’t a way to specify this ability. Perhaps @leleamol who’s working on an update to CPP API may be able to point you to a solution.


#9

The mxnet::cpp API supports creating shared executors. You would need to load the model and parameters only once and create shared executors catering to different batch-sizes.

I have written an RNN inference example (the PR is out for review) to demonstrate inference with variable input size.
Here is the link https://github.com/apache/incubator-mxnet/pull/13680

Please let me know if it helps.


#10

Thanks, this helps a lot, I will pull it from github and use it after this pull request are merged.


#11

Some questions about the example.

About the constructor

args_map["data0"] = NDArray(Shape(num_words, 1), global_ctx, false);
args_map["data1"] = NDArray(Shape(1), global_ctx, false);
  1. num_words is analogous to batch size of computer vision task?
  2. if 1 is correct, maximum batch size is same as num words?
  3. “data1” is the batch size at runtime?

According to PredictSentiment

std::vector<float> index_vector(num_words, GetIndexForWord("<eos>"));
int num_words = ConverToIndexVector(input_text, &index_vector);

executor->arg_dict()["data0"].SyncCopyFromCPU(index_vector.data(), index_vector.size());
executor->arg_dict()["data1"] = num_words; 
  1. Why do you initialize index_vector if you are going to clear it in ConverToIndexVector?
  2. index_vector.size should has the same value as num_words, why don’t just use it to replace num_words?

If I want to apply this technique to computer vision task, what changes I need to do? Are following procedures correct?

When construct

args_map["data0"] = NDArray(Shape(max_batch_size, height, width, channel), global_ctx, false);
args_map["data1"] = NDArray(Shape(1), global_ctx, false);

When predict

std::vector<float> image_vector;
//predict_images is a vector which contain the images converted to the format required
//by the mxnet network
for(auto const &img : predict_images){
    std::copy(std::begin(img), std::end(img), std::back_insert(image_vector));
}
executor->arg_dict()["data0"].SyncCopyFromCPU(image_vector.data(), image_vector.size());
executor->arg_dict()["data1"] = num_words; 

From technical viewpoint, is it possible to support variable input batch size without pre-allocate maximum batch-size?

Thanks