Understanding the point of the sigmoid function in logistic regression

The sigmoid function, f(x) = 1/(1+e^(-x)), returns [0.5,1) for all x >= 0, and (0,0.5) for all x < 0. (I believe).

In mission 241, we’re asked to apply the sigmoid function to each value x in an array, then set x to 1 if sigmoid(x) >= 0.5, and 0 if sigmoid(x) <= 0.5. But, seeing as the sigmoid function returns a number greater than 0.5 if fed a number greater than or equal to zero, and a number less than 0.5 if fed a number less than zero, why not cut out the middleman and skip the step of applying the sigmoid function to each value. Instead, why not just set each value to 1 if the value is greater than or equal to zero, and 0 if the value is less than zero?

For the purpose of mission 241, I tested this approach and it worked. See the following code:

from sklearn.linear_model import SGDClassifier
from sklearn.datasets import make_classification

class_data = make_classification(n_samples=100, n_features=4, random_state=1)
class_features = pd.DataFrame(class_data[0])
class_labels = pd.Series(class_data[1])
class_features['bias'] = 1

def log_train(class_features,class_labels):
    lr = SGDClassifier()
    return lr.coef_

def sigmoid_skip(linear_combination):
    for i in range(len(linear_combination)):
        if linear_combination[i] >= 0:
            linear_combination[i] = 1
            linear_combination[i] = 0
    return linear_combination

def sigmoid(linear_combination):
    for i in range(len(linear_combination)):
        linear_combination[i] = 1/(1+np.exp(-linear_combination[i]))
    return linear_combination

def log_feedforward(class_features,log_train_weights):
    linear_combination = np.dot(class_features,np.matrix.transpose(log_train_weights))
    ##Comment OUT the two lines below
    log_predictions = sigmoid_skip(linear_combination)    
    return log_predictions 

    # Uncomment these lines below

    # log_predictions = sigmoid(linear_combination)
    # for i in range(len(log_predictions)):
    #     if log_predictions[i] >= 0.5:
    #         log_predictions[i] = 1
    #     else:
    #         log_predictions[i] = 0
    # return log_predictions

# Uncomment this code when you're ready to test your functions.
log_train_weights = log_train(class_features, class_labels)
log_predictions = log_feedforward(class_features, log_train_weights)

This code runs using the approach that skips applying the sigmoid function to values and directly converts each value to zero or one. The code has the same result if one comments out the lines marked “comment out”, and uncomments the lines marked “uncomment out”.

I’m sure the sigmoid function is important, but I am a little confused as to why it’s important. I know one of the reasons it’s important is because it converts numbers to a number between zero and 1, but I imagine that can’t be the only reason why it’s important, seeing as one can simply code in something along the lines of

if x >= 0:
    return 1
    return 0

The above code also must run faster than applying the sigmoid function to each value then applying another function to each value to see if its less than or greater than (or equal to) zero.

So, what exactly are the reasons why the sigmoid function is so important, and how important are such reasons to know for the purposes of being a good data scientist?

That’s an excellent question and observation.

You are right, you can remove the middleman as you put it. But only for this scenario.

As you point out, sigmoid outputs values between 0 and 1. And that’s very important because that means we can interpret those outpuut values as probabilities. Since probability has a range of 0 and 1 as well.

So, when you say something like sigmoid(x) > 0.5, that 0.5 acts as a threshold. It basically is the equivalent of saying whether or not the probability of x is more than 0.5.

The reason why that’s important is for like a classification task. Taking the sigmoid can give us a set of probability values which might tell us the probability of something. For example, let’s say you have 10 images. 8 are of cats and 2 of dogs. We train a model to help us classify each image.

But each model returns a value we can’t really make sense of. For simplicity, let’s assume those values are like -1.5, 0.3, 1.2 etc. How do we look at those and say - “Ok, this is an image of a cat”?.

We can’t do that easily. So, we apply Sigmoid to those values. In return we get a value for each between 0 and 1. And we can now interpret those new values as the probability that an image has a cat or not. For example, output of sigmoid might be 0.85. So, we can assume that the probability that the image has a cat is 0.85.

But what if it was 0.55?

Now, depending on the problem we are solving and our model, 0.55 could be a cat or not. We are programming our computer to figure that out. How does it know at what probability value is the classification correct?

That’s when we can define a threshold. We can say that 0.5 is our threshold. Anything below that is a dog. Anything above that is a cat.

That’s what sigmoid(x) > 0.5 is doing. But that 0.5 could be a different value. We had 8 images of cats, a 0.5 probability means our classification isn’t that good.

If we say sigmoid(x) > 0.75, would your approach of removing the middleman still hold? Well, no. Because 0.5 is no longer our desired threshold. So, we can’t just set each value to 1 (or in our simplified example, a “cat”) if it’s greater than 0 because there might be a value which is greater than 0 but still less than 0.75. We don’t want that.

That’s why sigmoid is important. It acts as a proxy to probability values which are simpler for us to utilize and interpret.

1 Like

This shortcut only exists in the last layer because the threshold is only applied after the last sigmoid. For all intermediate layers, the unthresholded sigmoid output is passed to the next layer.
By not thresholding directly, it also allows you to graft layers across different networks.

1 Like

Wow, thank you for the very clear and understandable response. I genuinely understand now. Thank you!