How to make and use a dataset for a lstm

I have some data points and a label for each point. I want to make a lstm to learn the dynamics of this sequency of points given the lables but I have no idea how can I make a data set to use it in a lstm network for mxnet gluon. I thougth first in doing just a csv with the data and other with the labels but I think this is not a proper way of using my data for the lstm aplication. I’m a machine learning beginer and I’m completely lost.

If you have text then the first thing is to convert the words into a vector representation. If you have timeseries data then you need to normalize the data. Here is some example code for a timeseries dataset:

def create_normalized_data(data, sequence_length):
    
    # number of time series values
    n_samples = data.shape[0]  
    
    # create empty matrices n_samples x sequence_length x 1
    x = np.zeros((n_samples - sequence_length, sequence_length - 1, 1), dtype=np.float32)
    y = np.zeros((n_samples - sequence_length, 1), dtype=np.float32)
    
    # create normalized sequences
    for i in range(0, n_samples - sequence_length): 
        
        # get window
        window = data[i : i + sequence_length,:] 
        
        # normalize
        normalized = [((float(p) / float(window[0])) - 1) for p in window[:]]    
        
        # assign 
        x[i,:,0]   = normalized[:-1]
        y[i,0]     = normalized[-1]
        
    return x, y

num_epochs=5
batch_size=128
ctx = mx.cpu()

# split train/test
n = int(data.shape[0] * 0.85)

# how many past time steps to consider
sequence_length = 20

# get normalized train and test dataset 
x_train, y_train = create_normalized_data(data[:n,:], sequence_length)
x_test, y_test   = create_normalized_data(data[n:,:], sequence_length)

# create Dataloader
dataset = gluon.data.ArrayDataset(x_train, y_train)
train_dataloader = gluon.data.DataLoader(dataset, batch_size=batch_size, last_batch="rollover", shuffle=True)

First you need to define a sequence length e.g. 10 so the first training item is x1 to x10 and the label is x11, next training item is x2 to x11 and label is x12 and so on.

A simple LSTM model may look like the following:


# Create Vanilla LSTM: 1 LSTM layer plus output layer
class VanillaLSTM(gluon.nn.HybridBlock):
    
    def __init__(self, **kwargs):
        
        super(VanillaLSTM, self).__init__(**kwargs)
        
        with self.name_scope():
            
            # NTC = data in the format of batch, time, channel
            self.lstm = gluon.rnn.LSTM(100, layout="NTC")
            
            # prediction layer
            self.dense = gluon.nn.Dense(1)
            
    # forward takes input and LSTM state vector        
    def hybrid_forward(self, F, x, **kwargs):
        
        # forward through LSTM
        x = self.lstm(x)
        
        # create prediction
        x = self.dense(x)
        
        #return prediction and state vector
        return x

And the corresponding training loop:

# Create model
model= VanillaLSTM()

# imperative -> symbolic
model.hybridize()

# initialize
model.collect_params().initialize(mx.init.Xavier(), ctx=ctx)


# Loss
l2loss = gluon.loss.L2Loss()

# Trainer
optimizer = gluon.Trainer(model.collect_params(), 'adam', {'learning_rate': 1e-4})

def train(model, train_dataloader):
    
    for epoch in range(num_epochs):
    
        losses = 0
        
        # Iterate over training data
        for idx, (batch, label) in enumerate(train_dataloader):

            # Load data on GPU
            batch  = batch.as_in_context(ctx)
            label = label.as_in_context(ctx)

            with mx.autograd.record():
                
                # Forward pass
                predicted = model(batch)

                # Compute loss
                loss = l2loss(predicted, label)
        
            # store loss
            losses += mx.nd.mean(loss).asscalar()
            
            # Backward pass
            loss.backward()

            # Optimize
            optimizer.step(batch_size)

        print('epoch [{}/{}], loss:{:.7f}'.format(epoch + 1, num_epochs, losses/idx))
        
train(model, train_dataloader)
1 Like

You are really an angel, Thank you very much. I have just one doubt : I’m trying to load a data set but none of the ways I’m doing is working with your exemple code. I used the comands ‘open’, csvIter, and the one that almost worked was ‘pd.read_csv’ but I have to use ‘.values’ to access the values of the data. Now the problem is here:
----> x_train, y_train = create_normalized_data(data.values[:n,:], sequence_length)
with the following error message:
TypeError: only size-1 arrays can be converted to Python scalars

but I’m using a simple time series dataset found on internet to test in the following configuration:
Month,Sales
1-01,266.0
1-02,145.9
1-03,183.1
I know probably it’s a really basic question but I passed all day long trying solutions but always there is an error.

I assume there is a problem with the shape of your data array. Can you check the shape with print data.shape ? You can read timeseries data the following way:

import pandas as pd
df = pd.read_csv('data.csv')
data = df[['Sales']].values

I have some example code for timeseries forecasting https://github.com/NRauschmayr/NYC_Workshop/blob/master/11-Forecasting/Forecasting.ipynb