FPS for object detection inference on GPU

Hi there,
I re-trained the ‘ssd_512_resnet50_v1_custom’ model on a custom dataset, and now I wanted to estimate the FPS for inference on a GeForce RTX 2080 Ti.

I am using this code:

def main():

try:
	a = mx.nd.zeros((1,), ctx=mx.gpu(1))
	ctx = [mx.gpu(1)]
except:
	ctx = [mx.cpu()]

# -------------------------
# Load model
# -------------------------
classes = ['Guitar', 'face']
net = model_zoo.get_model('ssd_512_resnet50_v1_custom', ctx=ctx, classes=classes, pretrained_base=False)
net.load_parameters('saved_weights/test_000/ep_30.params')

# Load the webcam handler
cap = cv2.VideoCapture("video/video_01.mp4")

count_frame = 0

loading_frame_FPSs = np.zeros(844)
pre_processing_FPSs = np.zeros(844)
inference_FPSs = np.zeros(844)
total_FPSs = np.zeros(844)

while(True):
	print(f"Frame: {count_frame}")

	total_t_frame = 0

	#######
	start_t = time.time()
	#######
	# Load frame from the camera
	ret, frame = cap.read()
	#######
	stop_t = time.time()
	total_t_frame += (stop_t - start_t)
	FPS = 1/(stop_t-start_t)
	loading_frame_FPSs[count_frame] = FPS
	print(f"\tloading frame time = {(stop_t-start_t)} -> FPS = {FPS}")
	#######

	if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
		cv2.destroyAllWindows()
		cap.release()
		break

	#######
	start_t = time.time()
	#######
	# Image pre-processing
	frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
	rgb_nd, frame = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
	#######
	stop_t = time.time()
	total_t_frame += (stop_t - start_t)
	FPS = 1/(stop_t-start_t)
	pre_processing_FPSs[count_frame] = FPS
	print(f"\timage pre-processing time = {(stop_t-start_t)} -> FPS = {FPS}")
	#######

	#######
	start_t = time.time()
	#######
	# Run frame through network
	class_IDs, scores, bounding_boxes = net(rgb_nd)
	#######
	stop_t = time.time()
	total_t_frame += (stop_t - start_t)
	FPS = 1/(stop_t-start_t)
	inference_FPSs[count_frame] = FPS
	print(f"\tinference time = {(stop_t-start_t)} -> FPS = {1/(stop_t-start_t)}")
	#######

	print(f"\tTotal frame FPS = {1/total_t_frame}")
	total_FPSs[count_frame] = 1/total_t_frame


	count_frame += 1


cv2.destroyAllWindows()
cap.release()


print(f"Average FPS for:")
print(f"\tloading frame: {np.average(loading_frame_FPSs)}")
print(f"\tpre-processingg frame: {np.average(pre_processing_FPSs)}")
print(f"\tinference frame: {np.average(inference_FPSs)}")
print(f"\ttotal process: {np.average(total_FPSs)}")

So, basically I’m measuring the time required for every inference step (loading frame, resizing, inference), and calculating the FPS for each of these steps and in total.

Looking at the output

Average FPS for:
loading frame: 813.3313447171636
pre-processingg frame: 10.488629638752457
inference frame: 101.50787170217922
total process: 9.300166489874748

it seems that the bottleneck is mostly given by the pre-processing of the images.
When checking the output of nvidia-smi

I got

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.56       Driver Version: 418.56       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:18:00.0 Off |                  N/A |
| 36%   63C    P0    78W / 250W |     10MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  Off  | 00000000:3B:00.0 Off |                  N/A |
| 37%   65C    P2    84W / 250W |    715MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce RTX 208...  Off  | 00000000:86:00.0 Off |                  N/A |
| 36%   63C    P0    71W / 250W |     10MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce RTX 208...  Off  | 00000000:AF:00.0 Off |                  N/A |
| 27%   34C    P8    10W / 250W |    165MiB / 10989MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

which I guess is reasonable, since for inference I’m using just one image at a time, so I don’t expect the GPU usage to be as high as it is during training.

At this point, however, there are a couple of things I’m not sure about:

  1. when reading about the average FPS of SSD models, they’re usually mentioned to be in the range of 25-30 FPS. How do I get to those values? Is it all about image pre-processing?

  2. I tried to modify the block

     try:
     a = mx.nd.zeros((1,), ctx=mx.gpu(1))
     ctx = [mx.gpu(1)]
    

    except:
    ctx = [mx.cpu()]

with simply:

ctx = mx.gpu(1)

but it seems that this way the process is running on CPU (not even those 715 MB are occupied on GPU). Why is that?

Hi,

when you are timing MXNet operations, don’t forget that these are asynchronous. For instance, your net function call is not blocking for the results to be available, only when you start using the class_IDs, scores, bounding_boxes objects the synchronization happens.

You can add an explicit synchronization like this:

if isinstance(classes, mx.ndarray.ndarray.NDArray):
                classes.wait_to_read()

I’m sure that will reduce your 100FPS in the inference step to a more reasonable number for one GPU.

Processing one image at a time is quite slow, the main performance boost from using GPU’s by processing big batches of images at a time.

I’m not sure about your remark concerning using [mx.gpu(1)] directly, that should work, at least to load the model on the GPU.
However, when you are loading the image in an mx.nd.array you are not loading it on the GPU context now:

This should work (right before you pass the frame to the net() call.

rgb_nd = rgb_nd.as_in_context(ctx)

Note, you can run nvidia-smi continuously in the background, which makes it easier to monitor GPU usage (with a 1 second resolution):

nvidia-smi -l 1

hth,
Lieven

Hi, thank you very much for your reply!

Yes, after posting I had noticed I wasn’t properly loading images in the context and I modified that, the results after that were:

Average FPS for:
loading frame: 1412.2595796839475
pre-processingg frame: 24.302389215475003
inference frame: 128.85291454046728
total process: 19.77659590877164

However, I didn’t really understand the part about the asynchronous process…where should I add the explicit synchronization you suggested me?
I was looking around to figure it out, but I haven’t been successful so far.

The 3 NDArray objects class_IDs, scores, bounding_boxes that are returned by the call to net(), are not filled in with data yet at the time the next code line gets executed.
They are just pointers that will point to the actual data when it comes available, so when you’re neural net finishes its calculations.

So if you stop your clock directly after the net() call, you are just timing how long it takes MXNet to copy data to a queue and initiate the neural net to start processing said data.
What you want to time, is how long it takes for the net to actually finish all its calculations.

This waiting happens automatically when you try accessing the data in these 3 objects, for instance by copying it to a numpy array. You can trigger the wait manually by calling the wait_to_read() function on each of the 3 objects. So the example I posted before should come directly after the net() call.

hth,

Lieven

Ok, one last (I hope!) thing.

I modified the code as you suggested, and tested it on both my PC’s CPU and on a server’s GPU.

On the GPU, it works as expected, i.e. the inference FPS drops when I call the wait_to_read function, and the other times stay more or less the same:

***************************
	GPU
**************************
WITHOUT wait_to_read
Average FPS for:
loading frame: 1156.7550107488846
pre-processingg frame: 20.94492251977489
inference frame: 117.22317563930214
total process: 17.121521299835813

WITH wait_to_read
Average FPS for:
loading frame: 1006.13128393297
pre-processingg frame: 20.87063104105339
inference frame: 44.94674691249187
total process: 13.730170811376144

However, I don’t understand the outcome on my CPU. If I manually call the wait_to_read function, the inference FPS drops again, but on the other hand other processes (frame loading and pre-processing) seem to speed up a lot:

***************************
	CPU
**************************
NO wait_to_read
Average FPS for:
loading frame: 1089.0839114867913
pre-processingg frame: 12.040131318599638
inference frame: 215.90476966510465
total process: 10.94726385367164

WITH wait_to_read
Average FPS for:
loading frame: 2143.5243927221472
pre-processingg frame: 203.16948610709807
inference frame: 2.687850648054327
total process: 2.646779787380752

Again, the code I’m using looks something like this:

	while(True):
	print(f"Frame: {count_frame}")

	total_t_frame = 0

	#######
	start_t = time.time()
	#######
	# Load frame from the camera
	ret, frame = cap.read()
	#######
	stop_t = time.time()
	total_t_frame += (stop_t - start_t)
	FPS = 1/(stop_t-start_t)
	loading_frame_FPSs[count_frame] = FPS
	#######

	if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
		cv2.destroyAllWindows()
		cap.release()
		print("Done!!!")
		break

	#######
	start_t = time.time()
	#######
	# Image pre-processing
	frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
	rgb_nd, frame = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
	#######
	stop_t = time.time()
	total_t_frame += (stop_t - start_t)
	FPS = 1/(stop_t-start_t)
	pre_processing_FPSs[count_frame] = FPS
	#######


	#######
	start_t = time.time()
	#######
	# Run frame through network
	rgb_nd = rgb_nd.as_in_context(ctx)
	class_IDs, scores, bounding_boxes = net(rgb_nd)
	if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
		class_IDs.wait_to_read()
	if isinstance(scores, mx.ndarray.ndarray.NDArray):
		scores.wait_to_read()
	if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
		bounding_boxes.wait_to_read()
	######
	stop_t = time.time()
	total_t_frame += (stop_t - start_t)
	FPS = 1/(stop_t-start_t)
	inference_FPSs[count_frame] = FPS
	#######

	total_FPSs[count_frame] = 1/total_t_frame

	# print(count_frame)
	count_frame += 1

So, calling the wait_to_read function should only affect the inference measuring…unless I’m misunderstanding something, which at this point seems likely.
Any idea what I’m doing wrong?

I’m sorry to bug, I’m trying to figure this thing out…if it’s too long/complicated to explain, it would be great if you could suggest me where I could read about it!
Thanks again!

Hi.

I don’t directly see the impact of waiting on the results of the net on the performance of loading a frame from a camera.

You calculate the FPS based on only one sample? I suggest you try running this in a loop for thousand iterations and average the results, that will be a more representative sample.

Lieven

No, I’m using a video file as input, so I’m calculating the FPS for each frame, and then the averages over the whole video (the outputs I posted so far).
Again, there’s no impact on the other processes when I run it on the server GPU, but for some reason it changes on my computer’s CPU. I’ll look a little bit more into that, in the meanwhile I found this page which kinda clarified things for me! :grinning: