CV-Netural Network

Cross-entropy and its usage?

Cross-entropy loss is a commonly used loss function for classification tasks, particularly in multi-class problems. The loss function is defined as:

where:

  • is the ground truth label (one-hot encoded).
  • is the predicted probability for class k,
  • is the total number of classes.

Usage:

Multi-Class Classification: Cross-entropy is widely used in deep learning models for tasks like image classification (e.g., CNNs, transformers).

  • It is often paired with the softmax activation function in the output neural networks.

Advantages

  • Encourages confident and correct predictions.
  • Can handle imbalanced data by incorporating class weights.

How do we know if we are underfitting or overfitting?

Cross validation: measure prediction error on validation data.

Underfitting

  • add more parameters (more features, more layers, etc.)

Overfitting

  • remove parameters
  • add regularizers

Regularization

What are used to add nonlinearity to network? Compare?

Nonlinearity in neural networks is introduced through activation functions. These functions are applied to the output of each neuron to introduce nonlinearity, enabling the network to learn complex patterns.


Common Activation Functions

Activation Function Formula Range Properties Advantages Disadvantages
Sigmoid Smooth, differentiable Useful for probabilistic output Vanishing gradient problem; not zero-centered
Tanh Smooth, zero-centered Zero-centered output; better gradient flow Still suffers from vanishing gradients
ReLU Sparse activation Efficient; mitigates vanishing gradients "Dead neurons" if weights drive inputs negative
Leaky ReLU Allows small gradient for negatives Avoids dead neuron problem Small gradient for negatives limits learning rate
PReLU Learnable negative slope Adaptive negative slope Risk of overfitting due to extra parameters
Softmax Converts logits to probabilities Used for classification tasks Not for hidden layers; computationally expensive
Swish Smooth, differentiable Improves training; no dead neurons Computationally expensive
GELU Combines ReLU and Sigmoid concepts Smooth, differentiable Better for Transformer models Slower than ReLU

Comparison of Activation Functions

1. Vanishing Gradient Problem

  • Sigmoid and Tanh squash large inputs, leading to small gradients for large inputs, which hampers training in deep networks.
  • ReLU and its variants alleviate this issue by allowing gradients to flow for positive inputs.

2. Computational Efficiency

  • ReLU is computationally simple (), making it efficient for large networks.
  • Swish and GELU are more computationally intensive.

3. Sparse Activation

  • ReLU and its variants deactivate neurons for negative inputs, improving computational efficiency.

4. Zero-Centered Outputs

  • Tanh is zero-centered, helping optimization by balancing gradients.
  • Sigmoid is not zero-centered, potentially slowing convergence.

5. Handling Negative Inputs

  • Leaky ReLU and PReLU handle negative inputs, avoiding dead neurons.
  • Sigmoid and Tanh output non-negative values, which may not be ideal in some cases.

6. Probabilistic Outputs

  • Softmax is used in classification tasks to produce probabilities for each class, typically in the output layer.

Choosing an Activation Function

Hidden Layers:

  • ReLU is the default choice for simplicity and efficiency.
  • Leaky ReLU or PReLU for avoiding dead neurons.
  • Swish or GELU for modern architectures like Transformers.

Output Layers:

  • Sigmoid for binary classification.
  • Softmax for multi-class classification.

Forward and backward propagation

How to make the network deeper?

Adding More Layers

How: Simply add more hidden layers between the input and output layers. This increases the depth of the neural network.

Why: More layers allow the network to learn more complex, hierarchical representations of data. Each additional layer can capture higher-order features, which makes the model capable of handling more abstract patterns. By increasing depth, the model's capacity to learn complex mappings between inputs and outputs improves, making it suitable for tasks like image recognition, language modeling, and other advanced problems.

Residual Connections (Skip Connections)

How: In architectures like ResNet, residual connections are used where the input to a layer is added directly to its output, bypassing the transformation at that layer. This is often referred to as "skip connections."

Why: Residual connections address the vanishing gradient problem by allowing gradients to flow more easily during backpropagation, even in very deep networks. This makes it easier to train deep networks without worrying about gradients becoming too small to update the weights effectively. These connections also help the network maintain performance by allowing it to learn both the residual (new information) and the identity (previous knowledge) mapping.

Stacking Blocks of Layers

How: Instead of adding individual layers, a network can be built by stacking blocks of layers. For example, a convolutional block might consist of several convolutional layers followed by pooling layers. These blocks are repeated multiple times to form a deeper network.

Why: Stacking blocks of layers allows for more efficient learning. Each block can perform specific types of feature extraction (like edge detection in convolutional layers), and by stacking them, the network can learn progressively more abstract features. For example, early blocks may detect edges in images, while later blocks can recognize more complex shapes or objects. This modular approach also improves the reusability of network components, which makes training deeper networks more manageable.