Apply Now Apply Now Apply Now
header_logo
Post thumbnail
ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Bipolar Sigmoid Function in Neural Networks Explained

By Vishalini Devarajan

When you first learn neural networks, you quickly meet the sigmoid activation: a smooth curve that squashes inputs into (0, 1). It feels ideal, continuous, differentiable, and bounded. But there’s also a bipolar sigmoid that outputs values between -1 and 1, and that small change matters a lot during training.

Which one you pick affects how gradients flow backward through the network. Activations centered around zero (like the bipolar sigmoid) help gradients remain balanced and speed up convergence, while strictly positive outputs (like the standard sigmoid) can bias activations and slow or stall learning. Choosing the right activation can therefore mean the difference between a model that converges slowly or not at all and one that learns quickly.

In this article, we will walk through everything you need to understand about the bipolar sigmoid function, from its mathematical definition and core properties to how it compares with the standard sigmoid and ReLU, and when it makes sense to use it in your own machine learning projects.

Table of contents


  1. Quick TL;DR
  2. Understanding Activation Functions First
  3. The Standard Sigmoid vs. The Bipolar Sigmoid
  4. The Relationship Between Bipolar Sigmoid and Tanh
  5. The Zero-Centered Output: Why It Matters
  6. Continuous and Non-Linear: The S-Curve Advantage
  7. The Vanishing Gradient Problem
  8. Computational Cost
  9. Bipolar Sigmoid vs. Binary Sigmoid vs. ReLU
    • Comparison with ReLU
  10. When to Actually Use the Bipolar Sigmoid
  11. Wrapping Up
  12. FAQs
    • What is the bipolar sigmoid function?
    • How is it different from the standard sigmoid?
    • Why is zero-centering important?
    • What is the main drawback of bipolar sigmoid?
    • When should I use bipolar sigmoid?

Quick TL;DR

  • Bipolar sigmoid maps values to the range -1 to 1.
  • It is mathematically the same as tanh.
  • Zero-centered outputs help training converge faster.
  • It can reduce gradient bias compared with standard sigmoid.
  • It still suffers from vanishing gradients in deep networks.
  • ReLU is usually better for very deep models, but bipolar sigmoid remains useful in RNNs and LSTMs.

What Is the Bipolar Sigmoid Function?

The bipolar sigmoid function is an S-shaped activation function used in neural networks that transforms any input value into an output ranging from -1 to 1. Unlike the standard sigmoid function, which outputs values between 0 and 1, the bipolar sigmoid is zero-centered, helping to produce more balanced gradients during backpropagation and often improving learning efficiency. It is mathematically equivalent to the hyperbolic tangent function (tanh) and is commonly used in deep learning applications where centered activations are beneficial.

Understanding Activation Functions First

Before diving into the bipolar sigmoid specifically, it helps to understand what activation functions do and why they matter so much in neural networks.

  1. Once a neuron in a neural network receives and aggregates input values from other neurons, it does not pass that raw sum directly to the next layer. Instead, it passes it through an activation function, which transforms the value into the neuron’s actual output. An activation function is the way a neuron utilizes the aggregation values from other neurons. 
  2. For generality, the output can be written as a non-linear function of the sum product of the weights and the inputs.
  3. Without activation functions, no matter how many layers a neural network had, it would behave like a single linear equation. 
  4. It could not learn complex patterns. Activation functions introduce non-linearity into the network, which is what allows deep learning models to recognize images, understand language, and solve problems that no simple formula could handle.
  5. The choice of activation function, therefore, shapes how well the network can learn and how quickly it can converge during training.

The Standard Sigmoid vs. The Bipolar Sigmoid

  • The standard binary sigmoid function, the one most beginners encounter first, maps any input to a value strictly between 0 and 1. It is defined as f(x) = 1 / (1 + e^-x). For any large positive input, the output approaches 1.
  • For any large negative input, the output approaches 0. This makes it intuitive for representing probabilities.
  • The sigmoid function can be scaled to have any range of output values, depending on the problem. When the range is from -1 to 1, it is called a bipolar sigmoid.

The bipolar sigmoid is defined as:

f(x) = (1 – e^-x) / (1 + e^-x)

  • This formula looks very similar to the standard sigmoid, but the key difference is the output range. Instead of producing values between 0 and 1, the bipolar sigmoid produces values between -1 and 1. 
  • For large positive inputs, the function approaches +1. For large negative inputs, it approaches -1. At an input of zero, the output is exactly zero.

This might seem like a small difference, but as we will see in the next few sections, it has a significant impact on how well neural networks learn.

MDN

The Relationship Between Bipolar Sigmoid and Tanh

One thing that surprises many beginners is that the bipolar sigmoid function is not just similar to the hyperbolic tangent; it is mathematically identical to it. The tanh function is commonly written as:

tanh(x) = (e^x – e^-x) / (e^x + e^-x)

  • If you simplify the bipolar sigmoid formula, you arrive at the same result. Tanh is defined as tanh(x) = (e^x – e^-x) / (e^x + e^-x) and creates an S-shaped curve similar to a sigmoid. The key difference is that tanh is zero-centered, meaning it outputs both positive and negative values, which helps neural networks converge faster during training.
  • In practice, when you use the tanh activation function in a framework like PyTorch or TensorFlow, you are using the bipolar sigmoid. The two names refer to the same function, just approached from different conceptual angles. 
  • The term “bipolar sigmoid” emphasizes its relationship to the standard sigmoid and its bipolar output range. The term “tanh” emphasizes its mathematical origins as a hyperbolic function. Both descriptions are correct.

The Zero-Centered Output: Why It Matters

The most important property of the bipolar sigmoid is that its output is centered around zero. This is the defining advantage that separates it from the standard binary sigmoid, and it has real consequences for how quickly and effectively a neural network trains.

  • Because tanh outputs are symmetric around zero, the gradient descent process often converges faster, as the weights in the subsequent layers do not consistently move in a single direction, a phenomenon known as the zig-zag path in optimization.
  • To understand why this matters, consider what happens with the standard sigmoid. Because it only outputs values between 0 and 1, its outputs are always positive.
  • When these positive activations are fed into the next layer, the gradients flowing backward during training are forced to be either all positive or all negative for a given layer. 
  • This creates the zig-zag effect during optimization, where the network has to keep correcting itself in alternating directions rather than moving smoothly toward the minimum.
  • When activations stay balanced around zero, your network converges faster. Weight updates do not get biased in one direction, and gradients flow more evenly through layers. Tanh’s output range between -1 and 1 also prevents numerical instability.
  • The gradient of tanh is four times greater than the gradient of the sigmoid function near zero. This means that using the tanh activation function results in higher gradient values during training and higher updates in the weights of the network.
  • If we want strong gradients and big learning steps, we should use the tanh activation function. Another difference is that the output of tanh is symmetric around zero, leading to faster convergence.

Continuous and Non-Linear: The S-Curve Advantage

  1. Smoothness and differentiability

The bipolar sigmoid is smooth and differentiable everywhere. That means no abrupt jumps or discontinuities, so gradients are well-defined for all inputs  a useful property during optimization because backpropagation relies on those derivatives.

  1. Non‑saturating region around zero

Its S-shaped curve transitions gradually from -1 to +1, and near zero the slope is non‑zero and relatively large. Most neuron pre-activations during early training lie near this region, so the bipolar sigmoid provides a healthy learning signal where it matters most.

  1. Zero‑centering benefits

Outputs are centered around zero, unlike the binary sigmoid. Zero-centered activations lead to more balanced weight updates (less zig-zagging) and help optimizers converge faster because the mean of signals passing through layers is closer to zero.

  1. Ability to represent negative activations

By producing negative as well as positive values, the bipolar sigmoid lets neurons express inhibitory effects directly. That increases representational flexibility compared with strictly positive activations, helping the network capture symmetric or sign-dependent patterns.

The Vanishing Gradient Problem

  1. Why do vanishing gradients happen

Nonlinear activations like the bipolar sigmoid (and standard sigmoid/tanh) have derivatives that shrink toward zero for very large or very small inputs. At the extremes of their S-shaped curve, the function becomes nearly flat, so its slope, the derivative, is close to zero.

  1. How backpropagation amplifies the problem

During backpropagation, gradients are propagated by multiplying derivatives layer by layer. If each layer’s activation derivative is small, the product across many layers becomes exponentially smaller. This is the core mechanism that turns small local derivatives into a globally vanishing gradient.

  1. Practical effect on deep networks

When gradients vanish, early layers receive almost no learning signal. For example, if an activation’s derivative is about 0.25, after 5 layers the gradient contribution is scaled by 0.2550.25^50.255, producing an extremely small update for the first-layer weights. The network stops improving, not because it’s reached a good solution, but because the training signal has effectively disappeared.

  1. When this matters and what to do

This drawback makes bipolar sigmoid (and sigmoid/tanh) poor choices for very deep feedforward or convolutional networks. Use non-saturating activations like ReLU or its variants (Leaky ReLU, ELU) for deep architectures, or apply techniques such as careful initialization, batch normalization, or residual connections to mitigate vanishing gradients when you must use saturating activations.

Computational Cost

The bipolar sigmoid also carries a computational cost that is worth being aware of, particularly if you are building models that need to run quickly or on limited hardware.

  • Calculating the exponential function e^x is inherently more expensive than performing a simple arithmetic operation.
  • Every neuron using the bipolar sigmoid must compute this exponential for every forward pass, and then compute its derivative for every backward pass. In a large network with millions of neurons and thousands of training steps, this adds up significantly.
  • Tanh is computationally more expensive than ReLU. The Rectified Linear Unit, which simply returns the input if it is positive and zero if it is negative, requires only a comparison and a pass-through. 
  • No exponential computation is needed at all. This is one of the major practical reasons why ReLU replaced tanh as the default activation function for hidden layers in most modern deep learning architectures.

Bipolar Sigmoid vs. Binary Sigmoid vs. ReLU

  1. Comparison with binary sigmoid and tanh

The binary sigmoid outputs values in (0, 1) and is not zero-centered, which causes zig-zagging during optimization. Tanh is zero-centered and has a stronger gradient near zero, so it is preferred over binary sigmoid for internal gating (e.g., LSTM, GRU). 

Use the binary sigmoid when you need a probability at the output layer; for hidden layers, the bipolar sigmoid (zero-centered) is a better choice than the binary sigmoid because it reduces optimization zig-zagging.

2. Comparison with ReLU

ReLU and its variants solve vanishing gradients for positive inputs and avoid saturation, giving more stable gradient flow and faster training in deep networks. For most large feedforward and convolutional networks, ReLU or Leaky ReLU are the standards because they handle deep architectures more effectively. 

The bipolar sigmoid sits between sigmoid/tanh and ReLU: better than binary sigmoid for hidden layers due to being zero-centered, but less efficient than ReLU for very deep networks.

When to Actually Use the Bipolar Sigmoid

Given its limitations, knowing when the bipolar sigmoid is genuinely the right choice matters.

  • The tanh function is commonly used in the hidden layers of recurrent neural networks, LSTMs, and GRUs for natural language processing or time series tasks. Its range allows positive and negative activations, which is ideal for learning sequential dependencies. In these architectures, the bounded range of -1 to 1 is not just acceptable; it is useful. 
  • The LSTM architecture in particular was designed with tanh in mind, and changing the activation in these contexts can disrupt the carefully balanced gating mechanisms the architecture depends on.
  • In many scenarios, the tanh function is used in the hidden layers of neural networks. When data has both positive and negative values that need equal representation, tanh often performs better than sigmoid due to its centered range.
  • The bipolar sigmoid is also a reasonable choice for smaller networks where the vanishing gradient problem is less severe and for problems where symmetry in the output is meaningful.
  • If your input data is already normalized around zero, using an activation function centered at zero creates a natural alignment between the data distribution and the network’s internal representations.

If you’re serious about mastering the bipolar sigmoid function in neural networks, its activation behavior, use in output and hidden layers, gradient properties, and role in backpropagation, don’t miss the chance to enroll in HCL GUVI’s Certified Artificial Intelligence & Machine Learning Course, co-designed by Intel. 

Wrapping Up

The bipolar sigmoid function is a foundational concept in deep learning that bridges the gap between the standard binary sigmoid and the modern activation functions used today. 

By shifting the output range from 0 to 1 down to -1 to 1, it solves the bias problem that makes the binary sigmoid slow to converge. Its zero-centered outputs produce more balanced gradient flow during backpropagation, which translates directly into faster and more stable training.

The function is mathematically identical to tanh, and understanding one means you understand the other. Its main weaknesses are the vanishing gradient problem at extreme input values and its computational cost compared to simpler functions like ReLU. 

For hidden layers in deep feedforward networks, ReLU has largely taken over. But for recurrent architectures, LSTMs, GRUs, and problems where symmetric activations are meaningful, the bipolar sigmoid remains a relevant and practical choice that every machine learning practitioner should understand.

FAQs

1. What is the bipolar sigmoid function?

It is an S-shaped activation function that maps inputs to values between -1 and 1, and it is mathematically the same as tanh.

2. How is it different from the standard sigmoid?

Standard sigmoid outputs values between 0 and 1, while bipolar sigmoid is zero-centered and outputs between -1 and 1.

3. Why is zero-centering important?

Zero-centered outputs help gradients flow more evenly, reduce zig-zagging during optimization, and can speed up training.

4. What is the main drawback of bipolar sigmoid?

It can suffer from vanishing gradients when inputs are very large or very small, especially in deep networks.

MDN

5. When should I use bipolar sigmoid?

It is useful in hidden layers of RNNs, LSTMs, GRUs, and smaller networks where symmetric activations matter.

Success Stories

Did you enjoy this article?

Schedule 1:1 free counselling

Similar Articles

Loading...
Get in Touch
Chat on Whatsapp
Request Callback
Share logo Copy link
Table of contents Table of contents
Table of contents Articles
Close button

  1. Quick TL;DR
  2. Understanding Activation Functions First
  3. The Standard Sigmoid vs. The Bipolar Sigmoid
  4. The Relationship Between Bipolar Sigmoid and Tanh
  5. The Zero-Centered Output: Why It Matters
  6. Continuous and Non-Linear: The S-Curve Advantage
  7. The Vanishing Gradient Problem
  8. Computational Cost
  9. Bipolar Sigmoid vs. Binary Sigmoid vs. ReLU
    • Comparison with ReLU
  10. When to Actually Use the Bipolar Sigmoid
  11. Wrapping Up
  12. FAQs
    • What is the bipolar sigmoid function?
    • How is it different from the standard sigmoid?
    • Why is zero-centering important?
    • What is the main drawback of bipolar sigmoid?
    • When should I use bipolar sigmoid?