Invalid pointer error

laurenyu · October 31, 2018, 5:31pm

Hello,

I’ve been getting an invalid pointer error during prediction. I’ve tried different combinations of using and not using different malloc libraries and toggling environment variables (USE_GPERFTOOLS, USE_JEMALLOC, LD_PRELOAD).

The model is the basic MNIST example from https://mxnet.incubator.apache.org/tutorials/python/mnist.html

The input processing/prediction code:

stream = StringIO(string_like)
np_array = np.genfromtxt(stream, dtype=dtype, delimiter=',')
ndarray = mx.nd.array(np_array)

[data_shape] = self._model.data_shapes

model_batch_size = data_shape[1][0]
pad_rows = max(0, model_batch_size - ndarray.shape[0])

if pad_rows:
    num_pad_values = pad_rows
    for dimension in ndarray.shape[1:]:
        num_pad_values *= dimension
    padding_shape = tuple([pad_rows] + list(ndarray.shape[1:]))
    padding = mx.ndarray.zeros(shape=padding_shape)
    ndarray = mx.ndarray.concat(ndarray, padding, dim=0)

model_input = mx.io.NDArrayIter(ndarray, batch_size=model_batch_size,
                                last_batch_handle='pad')

if pad_rows:
    def _getpad():
        return pad_rows

    model_input.getpad = _getpad

model.predict(model_input)

The stacktrace:

algo-1-1MUHA_1  | *** Error in `/usr/bin/python': free(): invalid pointer: 0x00007fe22d965560 ***
algo-1-1MUHA_1  | ======= Backtrace: =========
algo-1-1MUHA_1  | /lib/x86_64-linux-gnu/libc.so.6(+0x777e5)[0x7fe2461ed7e5]
algo-1-1MUHA_1  | /lib/x86_64-linux-gnu/libc.so.6(+0x8037a)[0x7fe2461f637a]
algo-1-1MUHA_1  | /lib/x86_64-linux-gnu/libc.so.6(cfree+0x4c)[0x7fe2461fa53c]
algo-1-1MUHA_1  | /usr/local/lib/python3.5/dist-packages/mxnet/libmxnet.so(MXExecutorReshape+0x18c4)[0x7fe2389b29b4]
algo-1-1MUHA_1  | /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call_unix64+0x4c)[0x7fe241870e20]
algo-1-1MUHA_1  | /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(ffi_call+0x2eb)[0x7fe24187088b]
algo-1-1MUHA_1  | /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(_ctypes_callproc+0x49a)[0x7fe24186b01a]
algo-1-1MUHA_1  | /usr/lib/python3.5/lib-dynload/_ctypes.cpython-35m-x86_64-linux-gnu.so(+0x9fcb)[0x7fe24185efcb]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalCodeEx+0x88a)[0x5416ea]
algo-1-1MUHA_1  | /usr/bin/python[0x4ebe37]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x252b)[0x53920b]
algo-1-1MUHA_1  | /usr/bin/python[0x540199]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x50b2)[0x53bd92]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python[0x540199]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x50b2)[0x53bd92]
algo-1-1MUHA_1  | /usr/bin/python[0x540199]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x50b2)[0x53bd92]
algo-1-1MUHA_1  | /usr/bin/python[0x5406df]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x50b2)[0x53bd92]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalCodeEx+0x13b)[0x540f9b]
algo-1-1MUHA_1  | /usr/bin/python[0x4ebe37]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x252b)[0x53920b]
algo-1-1MUHA_1  | /usr/bin/python[0x5406df]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x54f0)[0x53c1d0]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalCodeEx+0x13b)[0x540f9b]
algo-1-1MUHA_1  | /usr/bin/python[0x4ebe37]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x252b)[0x53920b]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalCodeEx+0x13b)[0x540f9b]
algo-1-1MUHA_1  | /usr/bin/python[0x4ebd23]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python[0x4fb9ce]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python[0x574b36]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4ec6)[0x53bba6]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python[0x5406df]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x54f0)[0x53c1d0]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x4b04)[0x53b7e4]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalCodeEx+0x88a)[0x5416ea]
algo-1-1MUHA_1  | /usr/bin/python[0x4ebd23]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python[0x4fb9ce]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python[0x61ef5f]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalFrameEx+0x252b)[0x53920b]
algo-1-1MUHA_1  | /usr/bin/python(PyEval_EvalCodeEx+0x13b)[0x540f9b]
algo-1-1MUHA_1  | /usr/bin/python[0x4ebe37]
algo-1-1MUHA_1  | /usr/local/lib/python3.5/dist-packages/gevent/_greenlet.cpython-35m-x86_64-linux-gnu.so(+0x18acf)[0x7fe242a69acf]
algo-1-1MUHA_1  | /usr/local/lib/python3.5/dist-packages/gevent/__hub_local.cpython-35m-x86_64-linux-gnu.so(+0x781e)[0x7fe24352a81e]
algo-1-1MUHA_1  | /usr/bin/python(PyObject_Call+0x47)[0x5c1797]
algo-1-1MUHA_1  | ======= Memory map: ========
algo-1-1MUHA_1  | 00400000-007a9000 r-xp 00000000 08:01 2393040                            /usr/bin/python3.5
algo-1-1MUHA_1  | 009a9000-009ab000 r--p 003a9000 08:01 2393040                            /usr/bin/python3.5
algo-1-1MUHA_1  | 009ab000-00a42000 rw-p 003ab000 08:01 2393040                            /usr/bin/python3.5
algo-1-1MUHA_1  | 00a42000-00a73000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 013a7000-01900000 rw-p 00000000 00:00 0                                  [heap]
algo-1-1MUHA_1  | 01900000-02a4a000 rw-p 00000000 00:00 0                                  [heap]
algo-1-1MUHA_1  | 7fe218000000-7fe218021000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe218021000-7fe21c000000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe21c000000-7fe21c06c000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe21c06c000-7fe220000000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe220000000-7fe220021000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe220021000-7fe224000000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe224000000-7fe224021000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe224021000-7fe228000000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe228000000-7fe228021000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe228021000-7fe22c000000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22d074000-7fe22d134000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22d134000-7fe22d135000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22d135000-7fe22d935000 rwxp 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22d935000-7fe22ddf5000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22ddf5000-7fe22ddf9000 r-xp 00000000 08:01 2398965                    /usr/lib/python3.5/lib-dynload/termios.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22ddf9000-7fe22dff8000 ---p 00004000 08:01 2398965                    /usr/lib/python3.5/lib-dynload/termios.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22dff8000-7fe22dff9000 r--p 00003000 08:01 2398965                    /usr/lib/python3.5/lib-dynload/termios.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22dff9000-7fe22dffb000 rw-p 00004000 08:01 2398965                    /usr/lib/python3.5/lib-dynload/termios.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22dffb000-7fe22e07b000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22e07b000-7fe22e07c000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22e07c000-7fe22e87c000 rwxp 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22e87c000-7fe22e87d000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22e87d000-7fe22f07d000 rwxp 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22f07d000-7fe22f07e000 ---p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22f07e000-7fe22f87e000 rwxp 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22f8b6000-7fe22f9f6000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22f9f6000-7fe22f9f9000 r-xp 00000000 08:01 2398948                    /usr/lib/python3.5/lib-dynload/_multiprocessing.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22f9f9000-7fe22fbf8000 ---p 00003000 08:01 2398948                    /usr/lib/python3.5/lib-dynload/_multiprocessing.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22fbf8000-7fe22fbf9000 r--p 00002000 08:01 2398948                    /usr/lib/python3.5/lib-dynload/_multiprocessing.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22fbf9000-7fe22fbfa000 rw-p 00003000 08:01 2398948                    /usr/lib/python3.5/lib-dynload/_multiprocessing.cpython-35m-x86_64-linux-gnu.so
algo-1-1MUHA_1  | 7fe22fbfa000-7fe22fe7a000 rw-p 00000000 00:00 0
algo-1-1MUHA_1  | 7fe22fe7a000-7fe22fe7e000 r-xp 00000000 08:01 2506752                    /lib/x86_64-linux-gnu/libuuid.so.1.3.0
algo-1-1MUHA_1  | 7fe22fe7e000-7fe23007d000 ---p 00004000 08:01 2506752                    /lib/x86_64-linux-gnu/libuuid.so.1.3.0
algo-1-1MUHA_1  | 7fe23007d000-7fe23007e000 r--p 00003000 08:01 2506752                    /lib/x86_64-linux-gnu/libuuid.so.1.3.0

Any suggestions for how to proceed?

NRauschmayr · October 31, 2018, 6:00pm

I tried to reproduce your problem, but in my case it did not throw an error. Could you provide the full code?

laurenyu · October 31, 2018, 6:45pm

training script:

import argparse
import gzip
import json
import logging
import os
import struct

import mxnet as mx
import numpy as np


def load_data(path):
    with gzip.open(find_file(path, "labels.gz")) as flbl:
        struct.unpack(">II", flbl.read(8))
        labels = np.fromstring(flbl.read(), dtype=np.int8)
    with gzip.open(find_file(path, "images.gz")) as fimg:
        _, _, rows, cols = struct.unpack(">IIII", fimg.read(16))
        images = np.fromstring(fimg.read(), dtype=np.uint8).reshape(len(labels), rows, cols)
        images = images.reshape(images.shape[0], 1, 28, 28).astype(np.float32) / 255
    return labels, images


def find_file(root_path, file_name):
    for root, dirs, files in os.walk(root_path):
        if file_name in files:
            return os.path.join(root, file_name)


def build_graph():
    data = mx.sym.var('data')
    data = mx.sym.flatten(data=data)
    fc1 = mx.sym.FullyConnected(data=data, num_hidden=128)
    act1 = mx.sym.Activation(data=fc1, act_type="relu")
    fc2 = mx.sym.FullyConnected(data=act1, num_hidden=64)
    act2 = mx.sym.Activation(data=fc2, act_type="relu")
    fc3 = mx.sym.FullyConnected(data=act2, num_hidden=10)
    return mx.sym.SoftmaxOutput(data=fc3, name='softmax')


def get_train_context(num_gpus):
    if num_gpus:
        return [mx.gpu(i) for i in range(num_gpus)]
    else:
        return mx.cpu()

def train(batch_size, epochs, learning_rate, num_gpus, training_channel, testing_channel,
          hosts, current_host, model_dir):
    (train_labels, train_images) = load_data(training_channel)
    (test_labels, test_images) = load_data(testing_channel)

    # Data parallel training - shard the data so each host
    # only trains on a subset of the total data.
    shard_size = len(train_images) // len(hosts)
    for i, host in enumerate(hosts):
        if host == current_host:
            start = shard_size * i
            end = start + shard_size
            break

    train_iter = mx.io.NDArrayIter(train_images[start:end], train_labels[start:end], batch_size,
                                   shuffle=True)
    val_iter = mx.io.NDArrayIter(test_images, test_labels, batch_size)

    logging.getLogger().setLevel(logging.DEBUG)

    kvstore = 'local' if len(hosts) == 1 else 'dist_sync'

    mlp_model = mx.mod.Module(symbol=build_graph(),
                              context=get_train_context(num_gpus))
    mlp_model.fit(train_iter,
                  eval_data=val_iter,
                  kvstore=kvstore,
                  optimizer='sgd',
                  optimizer_params={'learning_rate': learning_rate},
                  eval_metric='acc',
                  batch_end_callback=mx.callback.Speedometer(batch_size, 100),
                  num_epoch=epochs)

    if len(hosts) == 1 or current_host == hosts[0]:
        save(model_dir, mlp_model)


def save(model_dir, model):
    model.symbol.save(os.path.join(model_dir, 'model-symbol.json'))
    model.save_params(os.path.join(model_dir, 'model-0000.params'))

    signature = [{'name': data_desc.name, 'shape': [dim for dim in data_desc.shape]}
                 for data_desc in model.data_shapes]
    with open(os.path.join(model_dir, 'model-shapes.json'), 'w') as f:
        json.dump(signature, f)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('--batch-size', type=int, default=100)
    parser.add_argument('--epochs', type=int, default=10)
    parser.add_argument('--learning-rate', type=float, default=0.1)

    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])

    parser.add_argument('--current-host', type=str, default=os.environ['SM_CURRENT_HOST'])
    parser.add_argument('--hosts', type=list, default=json.loads(os.environ['SM_HOSTS']))

    args = parser.parse_args()

    num_gpus = int(os.environ['SM_NUM_GPUS'])

    train(args.batch_size, args.epochs, args.learning_rate, num_gpus, args.train, args.test,
          args.hosts, args.current_host, args.model_dir)

I’m running this in a Docker container - you can see the Dockerfile here: https://github.com/aws/sagemaker-mxnet-container/blob/sagemaker-containers-migration/docker/1.3.0/final/Dockerfile.cpu. It’s installing mxnet 1.3.0.post0.

For the prediction part, I’m using SageMaker’s Batch Transform functionality - trying to do something similar to this example notebook: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/mxnet_mnist/mxnet_mnist_with_batch_transform.ipynb

The input data is in s3://sagemaker-sample-data-us-west-2/batch-transform/mnist (should be public)

simonco · October 31, 2018, 8:19pm

@laurenyu Do you get this error with a standard pip install of MXNet?

Did you build MXNet from source to debug or were you already compiling from source?

laurenyu · October 31, 2018, 8:33pm

I’m installing MXNet from PyPI (using the 1.3.0.post0 version) in a Docker container - the Dockerfile can be found here: https://github.com/aws/sagemaker-mxnet-container/blob/sagemaker-containers-migration/docker/1.3.0/final/Dockerfile.cpu

NRauschmayr · November 1, 2018, 3:59am

Thanks Lauren for sharing your code. After I managed to reproduce your problem, I found that model.predict caused an error because the input data had the wrong shape.

The following code reads one image from a file and the ndarray has then the shape (784,1)

np_array = np.genfromtxt(stream, dtype=dtype, delimiter=',')
ndarray = mx.nd.array(np_array)

The following code segment ensures that the data has the right shape:

pad_rows = max(0, model_batch_size - ndarray.shape[0])
if pad_rows:
    num_pad_values = pad_rows
    for dimension in ndarray.shape[1:]:
        num_pad_values *= dimension
    padding_shape = tuple([pad_rows] + list(ndarray.shape[1:]))
    padding = mx.ndarray.zeros(shape=padding_shape)
    ndarray = mx.ndarray.concat(ndarray, padding, dim=0)

However it turned out that pad_rows would always have the wrong value (mostly 0) because of max(0, model_batch_size - ndarray.shape[0]). In this example, one image was loaded and ndarray.shape[0] would be 784 instead of 1. As such the condition if pad_rows was not met and the ndarray would still have the shape (784,1) which caused model.predict to fail because it expected the data to be of shape [batch_size, 1, 28, 28]

laurenyu · November 1, 2018, 4:18am

ah, makes sense! Thanks for your help!

Topic		Replies	Views
Got an error when serializing trained model Discussion	2	352	May 20, 2019
OOM when trying to slice and print nd array	1	1647	September 11, 2018
Question 1 part 3 Courses	3	476	February 23, 2019
Uninformative error on Deconvolution Op	1	706	May 3, 2018
When using .asscalar(), error occured: mxnet_generic_kernel_ex ErrStr:invalid resource handle Gluon	2	584	October 23, 2019

Invalid pointer error

Related Topics