Pytorch Binary Classification - Same Network Structure, 'simpler' Data, But Worse Performance?

April 05, 2024 Post a Comment

To get to grips with PyTorch (and deep learning in general) I started by working through some basic classification examples. One such example was classifying a non-linear dataset c

Solution 1:

TL;DR

Your input data is not normalized.

use x_data = (x_data - x_data.mean()) / x_data.std()
increase the learning rate optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

You'll get

convergence in only 1000 iterations.

More details

The key difference between the two examples you have is that the data x in the first example is centered around (0, 0) and has very low variance. On the other hand, the data in the second example is centered around 92 and has relatively large variance.

This initial bias in the data is not taken into account when you randomly initialize the weights which is done based on the assumption that the inputs are roughly normally distributed around zero. It is almost impossible for the optimization process to compensate for this gross deviation - thus the model gets stuck in a sub-optimal solution.

Once you normalize the inputs, by subtracting the mean and dividing by the std, the optimization process becomes stable again and rapidly converges to a good solution.

For more details about input normalization and weights initialization, you can read section 2.2 in He et alDelving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (ICCV 2015).

What if I cannot normalize the data?

If, for some reason, you cannot compute mean and std data in advance, you can still use nn.BatchNorm1d to estimate and normalize the data as part of the training process. For example

Baca Juga

classModel(nn.Module):
    def__init__(self, input_size, H1, output_size):
        super().__init__()
        self.bn = nn.BatchNorm1d(input_size)  # adding batchnorm
        self.linear = nn.Linear(input_size, H1)
        self.linear2 = nn.Linear(H1, output_size)
    
    defforward(self, x):
        x = torch.sigmoid(self.linear(self.bn(x)))  # batchnorm the input x
        x = torch.sigmoid(self.linear2(x))
        return x

This modification without any change to the input data, yields similar convergance after only 1000 epochs:

A minor comment

For numerical stability, it is better to use nn.BCEWithLogitsLoss instead of nn.BCELoss. For this end, you need to remove the torch.sigmoid from the forward() output, the sigmoid will be computed inside the loss. See, for example, this thread regarding the related sigmoid + cross entropy loss for binary predictions.

Solution 2:

Let's start first by understanding how neural networks work, neural networks observe patterns, hence the necessity for large datasets. In the case of the example, two what pattern you intend to find is when if HR < 91: label = 0, this if-condition can be represented by the formula, sigmoid((HR-91) * 1) , if you plug various values into the formula you can see you that all values < 91, label 0 and others label 1. I have inferred this formula and it could be anything as long as it gives the correct values.

Basically, we apply the formula wx+b, where x in our input data and we learn the values for w and b. Now initially the values are all random, so getting the b value from 1030131190 (a random value), to maybe 98 is fast, since the loss is great, the learning rate allows the values to jump fast. But once you reach 98, your loss is decreasing, and when you apply the learning rate, it takes it more time to reach closer to 91, hence the slow decrease in loss. As the values get closer, the steps taken are even slower.

This can be confirmed via the loss values, they are constantly decreasing, initially, the deceleration is higher, but then it becomes smaller. Your network is still learning but slowly.

Hence in deep learning, you use this method called stepped learning rate, wherewith the increase in epochs you decrease your learning rate so that your learning is faster

Python Channel