How to make and use a dataset for a lstm

Jose_Lucas · July 17, 2019, 7:02pm

I have some data points and a label for each point. I want to make a lstm to learn the dynamics of this sequency of points given the lables but I have no idea how can I make a data set to use it in a lstm network for mxnet gluon. I thougth first in doing just a csv with the data and other with the labels but I think this is not a proper way of using my data for the lstm aplication. I’m a machine learning beginer and I’m completely lost.

NRauschmayr · July 19, 2019, 9:31pm

If you have text then the first thing is to convert the words into a vector representation. If you have timeseries data then you need to normalize the data. Here is some example code for a timeseries dataset:

def create_normalized_data(data, sequence_length):
    
    # number of time series values
    n_samples = data.shape[0]  
    
    # create empty matrices n_samples x sequence_length x 1
    x = np.zeros((n_samples - sequence_length, sequence_length - 1, 1), dtype=np.float32)
    y = np.zeros((n_samples - sequence_length, 1), dtype=np.float32)
    
    # create normalized sequences
    for i in range(0, n_samples - sequence_length): 
        
        # get window
        window = data[i : i + sequence_length,:] 
        
        # normalize
        normalized = [((float(p) / float(window[0])) - 1) for p in window[:]]    
        
        # assign 
        x[i,:,0]   = normalized[:-1]
        y[i,0]     = normalized[-1]
        
    return x, y

num_epochs=5
batch_size=128
ctx = mx.cpu()

# split train/test
n = int(data.shape[0] * 0.85)

# how many past time steps to consider
sequence_length = 20

# get normalized train and test dataset 
x_train, y_train = create_normalized_data(data[:n,:], sequence_length)
x_test, y_test   = create_normalized_data(data[n:,:], sequence_length)

# create Dataloader
dataset = gluon.data.ArrayDataset(x_train, y_train)
train_dataloader = gluon.data.DataLoader(dataset, batch_size=batch_size, last_batch="rollover", shuffle=True)

First you need to define a sequence length e.g. 10 so the first training item is x1 to x10 and the label is x11, next training item is x2 to x11 and label is x12 and so on.

A simple LSTM model may look like the following:


# Create Vanilla LSTM: 1 LSTM layer plus output layer
class VanillaLSTM(gluon.nn.HybridBlock):
    
    def __init__(self, **kwargs):
        
        super(VanillaLSTM, self).__init__(**kwargs)
        
        with self.name_scope():
            
            # NTC = data in the format of batch, time, channel
            self.lstm = gluon.rnn.LSTM(100, layout="NTC")
            
            # prediction layer
            self.dense = gluon.nn.Dense(1)
            
    # forward takes input and LSTM state vector        
    def hybrid_forward(self, F, x, **kwargs):
        
        # forward through LSTM
        x = self.lstm(x)
        
        # create prediction
        x = self.dense(x)
        
        #return prediction and state vector
        return x

And the corresponding training loop:

# Create model
model= VanillaLSTM()

# imperative -> symbolic
model.hybridize()

# initialize
model.collect_params().initialize(mx.init.Xavier(), ctx=ctx)


# Loss
l2loss = gluon.loss.L2Loss()

# Trainer
optimizer = gluon.Trainer(model.collect_params(), 'adam', {'learning_rate': 1e-4})

def train(model, train_dataloader):
    
    for epoch in range(num_epochs):
    
        losses = 0
        
        # Iterate over training data
        for idx, (batch, label) in enumerate(train_dataloader):

            # Load data on GPU
            batch  = batch.as_in_context(ctx)
            label = label.as_in_context(ctx)

            with mx.autograd.record():
                
                # Forward pass
                predicted = model(batch)

                # Compute loss
                loss = l2loss(predicted, label)
        
            # store loss
            losses += mx.nd.mean(loss).asscalar()
            
            # Backward pass
            loss.backward()

            # Optimize
            optimizer.step(batch_size)

        print('epoch [{}/{}], loss:{:.7f}'.format(epoch + 1, num_epochs, losses/idx))
        
train(model, train_dataloader)

JoseLucas · July 23, 2019, 12:51pm

You are really an angel, Thank you very much. I have just one doubt : I’m trying to load a data set but none of the ways I’m doing is working with your exemple code. I used the comands ‘open’, csvIter, and the one that almost worked was ‘pd.read_csv’ but I have to use ‘.values’ to access the values of the data. Now the problem is here:
----> x_train, y_train = create_normalized_data(data.values[:n,:], sequence_length)
with the following error message:
TypeError: only size-1 arrays can be converted to Python scalars

but I’m using a simple time series dataset found on internet to test in the following configuration:
Month,Sales
1-01,266.0
1-02,145.9
1-03,183.1
I know probably it’s a really basic question but I passed all day long trying solutions but always there is an error.

NRauschmayr · July 23, 2019, 2:10pm

I assume there is a problem with the shape of your data array. Can you check the shape with print data.shape ? You can read timeseries data the following way:

import pandas as pd
df = pd.read_csv('data.csv')
data = df[['Sales']].values

I have some example code for timeseries forecasting https://github.com/NRauschmayr/NYC_Workshop/blob/master/11-Forecasting/Forecasting.ipynb

Topic		Replies	Views
How to efficiently build a RNN or LSTM in terms of mxnet symbol? Discussion	2	738	April 5, 2018
Translating Keras LSTM signal analysis example to MXNet Discussion	0	429	February 12, 2020
Error when porting keras-tf LSTM to keras-mxnet Discussion	1	657	October 22, 2018
Create a Dataset Using RecordIO Gluon	1	731	March 21, 2019
RNN example needed Gluon	2	656	March 26, 2019

How to make and use a dataset for a lstm

Related Topics