I’m training a gluoncv SSD script in a sagemaker P3.2xl notebook instance (V100 GPU).
The training runs fine in the notebook
The exact same script, running in the same instance but within the official AWS SageMaker docker image for MXNet (https://github.com/aws/sagemaker-mxnet-container) errors:
Worker timed out after 120 seconds. This might be caused by - Slow transform. Please increase timeout to allow slower data loading in each worker. - Insufficient shared_memory if `timeout` is large enough. Please consider reduce `num_workers` or increase shared_memory in system.
I never saw that error in 2 years. What is it? Why happening in docker and not out of docker?