Getting error when try to train my Own Model on ImageNet

mgsmsobuj · January 23, 2020, 1:25am

I a using this " Train Your Own Model on ImageNet" webpage to train my model on imagenet.

I got following error:
NameError: name ‘L’ is not defined when I try to run the training process. I have used the same code mentioned on that page. Everything worked fine except for the final training step.

Here is the picture of my code I got an error in line number 20: loss = loss = [L(yhat, y) for yhat, y in zip(outputs, label)]. I couldn’t figure out, what does L do in this code.

Error log

TristonC · January 23, 2020, 10:06pm

L is the loss function. And it should be the loss_fn.

mgsmsobuj · January 23, 2020, 10:30pm

Thanks TristonC

I used loss_fn instead of L now I am getting following error:
MXNetError: [17:29:26] src/storage/./pooled_storage_manager.h:151: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x40ba6a) [0x7fc3ef5aaa6a]
[bt] (1) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x40c081) [0x7fc3ef5ab081]
[bt] (2) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3428589) [0x7fc3f25c7589]
[bt] (3) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x342ca9f) [0x7fc3f25cba9f]
[bt] (4) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::NDArray::CheckAndAlloc() const+0x24b) [0x7fc3ef5f9e5b]
[bt] (5) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2d6f8cd) [0x7fc3f1f0e8cd]
[bt] (6) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x213) [0x7fc3f1f0eda3]
[bt] (7) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2cc1689) [0x7fc3f1e60689]
[bt] (8) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2ccafc4) [0x7fc3f1e69fc4]
[bt] (9) /home/murshed/anaconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x2ccf2b3) [0x7fc3f1e6e2b3]

mgsmsobuj · January 23, 2020, 10:34pm

My system has a core i9 processor, 64 GB RAM, and Nvidia 12GB GPU. Still, I am getting "cudaMalloc failed: out of memory"
Is there any way to train my model using the imageNet dataset with my PC

TristonC · January 23, 2020, 10:42pm

Could you try reduce the batch size? And it will take quite a while (definitely not sit there and wait) with the ImageNet dataset on your PC for 120 epochs as mentioned in the tutorial. I would not encourage to do that especially you only have one GPU.

mgsmsobuj · January 24, 2020, 12:05am

TristonC thanks again for quick your replay.

I want to run only 2 epochs for testing. If it is worked for 2 epochs then I will use AWS to run the training. My current batch size is 8 and it removes memory-related issues. However, now I am getting following errors from line number 30:
NameError: name ‘opt’ is not defined

What does opt stand for? Is it optimizer or something else?

TristonC · January 24, 2020, 12:26am

I don’t think you really need the ‘opt.’ there. It must be a copy paste error for a file which has not really QA verified. The opt stands for option, I believe.

mgsmsobuj · January 24, 2020, 3:47pm

I removed opt from my code. Now it is working. Thanks a lot @TristonC
Is there any way I can contribute my current code to this page? So anyone using this page can get help.

TristonC · January 24, 2020, 5:56pm

You can try file a issue on MXNet github repo.

Topic		Replies	Views
Cosine loss error Cannot differentiate node because it is not in a computational graph	2	854	March 26, 2019
Custom loss function from a pre-trained network Discussion	2	829	March 23, 2018
Getting error in Trainer function for action recognition Discussion	0	312	August 12, 2022
Cannot bind model with custom loss function Discussion	1	1378	January 24, 2018
Who can help me solve this error？（batch_loss.backward() error when use slice) Discussion	1	2269	March 8, 2018

Getting error when try to train my Own Model on ImageNet

Related Topics