Kernel dies when classifying test set and saving result

Has anyone encountered the issue that the notebook kernel just dies when trying run the last block to classify test set? This is really frustrating because it also clears our trained network and hours of work is completely gone…
If anyone knows how to resolve it, it’s much appreciated. Thanks!

Yeah this happens to me too, specifically it’s when it’s processing training set data between 222,000 and 223,000.

Might be one bad example or something in the test set that’s killing the kernel.

@ryantheisen @gold_piggy

Here’s the error message:

"terminate called after throwing an instance of ‘cv::Exception’
what(): OpenCV(3.4.2) /home/travis/build/dmlc/mxnet-distro/deps/opencv-3.4.2/modules/imgcodecs/src/loadsave.cpp:737 error: (-215:Assertion failed) !buf.empty()&& buf.isContinuous() in function ‘imdecode_’

Aborted (core dumped)"

Found the file in question: it’s test file “223065.png”, We’ll probably just have to guess a random prediction for this one

EDIT: More issues at 223066, 223067. I think the test set got screwed up past 223065, please fix the test set

EDIT 2: It’s not a test set problem, nvm. Perhaps an mxnet bug? Other people don’t have issues with the same test set on the kaggle challenge.
@ryantheisen @gold_piggy @smolix @mli

Same problem here, stuck for nearly two days. I think it is probably because of the memory limitation. When use AWS, everything seems fine(until now).

I think a possible solution is to break up the saving parts so that it appends to submission.csv one batch at a time. Haven’t fully tested this out yet because I’m re-training my model:(

Doesn’t work, I tried doing that and it will consistently crash after 223065. Tried starting from 224000, still crashes.

It’s not a memory limitation, I htop’d the system it was running on and there was like 50 GB RAM free at crash, and 6GB free on the GPU at crash.

Seems to be an issue in MXNET Gluon, similar to this:

Hey, were you running on GPU? Did it run through the last cell successfully before?

I did not enter into any issue after running through all the notebook…

Yes, running on GPU. The crash occurs before the data is even loaded onto the GPU, however. It’s from the test_iter loading the image and applying the transforms.

@gold_piggy I was running on GPU and mine crashed at 227328. I tried to save the result from each batch one at a time, but as @jesbu1 pointed out, it did not work.

Not sure what you mean by “run through the last cell successfully before”. If you mean by the original code, it runs on the tiny demo dataset so there isn’t any issue.

Can you email me code producing the error so I can see if I get the same on my machine?

Found a fix:

Install pytorch from here: pytorch.org

replace transform_test with

import torch import torchvision.transforms as transforms import torchvision transform_test = transforms.Compose([transforms.ToTensor(), transforms.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010]) ])

Replace test_ds with
test_ds = torchvision.datasets.ImageFolder('PATH-TO-FILES, transform=transform_test_torch)

Replace test_iter with
test_iter = torch.utils.data.DataLoader(test_ds, batch_size = batch_size, shuffle=False,)

In the last block, make the first two lines of the for loop through test_iter this:

for X, _ in test_iter: X = nd.array(X.numpy())

This is a Gluon bug where it doesn’t catch an OpenCV Error in C/C++ on some of the garbage test images (Kaggle generates 290,000 garbage test images since there’s only 10,000 real test images), however for some reason the Pytorch dataloader works fine.

3 Likes

I was able to run the last cell without any errors (using mxnet) so YMMV