date: 2024-12-20
title: CV-Netural Network
status: DONE
author:
  - AllenYGY
tags:
  - NOTE
publish: True

CV-Netural Network

Cross-entropy and its usage?

Cross-entropy loss is a commonly used loss function for classification tasks, particularly in multi-class problems. The loss function is defined as:

where:

is the ground truth label (one-hot encoded).
is the predicted probability for class k,
is the total number of classes.

Usage:

Multi-Class Classification: Cross-entropy is widely used in deep learning models for tasks like image classification (e.g., CNNs, transformers).

It is often paired with the softmax activation function in the output neural networks.

Advantages

Encourages confident and correct predictions.
Can handle imbalanced data by incorporating class weights.

How do we know if we are underfitting or overfitting?

Cross validation: measure prediction error on validation data.

Underfitting

add more parameters (more features, more layers, etc.)

Overfitting

remove parameters
add regularizers

Regularization

What are used to add nonlinearity to network? Compare?

Nonlinearity in neural networks is introduced through activation functions. These functions are applied to the output of each neuron to introduce nonlinearity, enabling the network to learn complex patterns.

Common Activation Functions

Activation Function	Formula	Properties	Advantages	Disadvantages
Sigmoid		Smooth, differentiable	Useful for probabilistic output	Vanishing gradient problem; not zero-centered
Tanh		Smooth, zero-centered	Zero-centered output; better gradient flow	Still suffers from vanishing gradients
ReLU		Sparse activation	Efficient; mitigates vanishing gradients	"Dead neurons" if weights drive inputs negative
Leaky ReLU		Allows small gradient for negatives	Avoids dead neuron problem	Small gradient for negatives limits learning rate
PReLU		Learnable negative slope	Adaptive negative slope	Risk of overfitting due to extra parameters
Softmax		Converts logits to probabilities	Used for classification tasks	Not for hidden layers; computationally expensive
Swish		Smooth, differentiable	Improves training; no dead neurons	Computationally expensive
GELU	Combines ReLU and Sigmoid concepts	Smooth, differentiable	Better for Transformer models	Slower than ReLU

Comparison of Activation Functions

1. Vanishing Gradient Problem

Sigmoid and Tanh squash large inputs, leading to small gradients for large inputs, which hampers training in deep networks.
ReLU and its variants alleviate this issue by allowing gradients to flow for positive inputs.

2. Computational Efficiency

ReLU is computationally simple (), making it efficient for large networks.
Swish and GELU are more computationally intensive.

3. Sparse Activation

ReLU and its variants deactivate neurons for negative inputs, improving computational efficiency.

4. Zero-Centered Outputs

Tanh is zero-centered, helping optimization by balancing gradients.
Sigmoid is not zero-centered, potentially slowing convergence.

5. Handling Negative Inputs

Leaky ReLU and PReLU handle negative inputs, avoiding dead neurons.
Sigmoid and Tanh output non-negative values, which may not be ideal in some cases.

6. Probabilistic Outputs

Softmax is used in classification tasks to produce probabilities for each class, typically in the output layer.

Choosing an Activation Function

Hidden Layers:

ReLU is the default choice for simplicity and efficiency.
Leaky ReLU or PReLU for avoiding dead neurons.
Swish or GELU for modern architectures like Transformers.

Output Layers:

Sigmoid for binary classification.
Softmax for multi-class classification.

Forward and backward propagation

How to make the network deeper?

Adding More Layers

How: Simply add more hidden layers between the input and output layers. This increases the depth of the neural network.

Why: More layers allow the network to learn more complex, hierarchical representations of data. Each additional layer can capture higher-order features, which makes the model capable of handling more abstract patterns. By increasing depth, the model's capacity to learn complex mappings between inputs and outputs improves, making it suitable for tasks like image recognition, language modeling, and other advanced problems.

Residual Connections (Skip Connections)

How: In architectures like ResNet, residual connections are used where the input to a layer is added directly to its output, bypassing the transformation at that layer. This is often referred to as "skip connections."

Why: Residual connections address the vanishing gradient problem by allowing gradients to flow more easily during backpropagation, even in very deep networks. This makes it easier to train deep networks without worrying about gradients becoming too small to update the weights effectively. These connections also help the network maintain performance by allowing it to learn both the residual (new information) and the identity (previous knowledge) mapping.

Stacking Blocks of Layers

How: Instead of adding individual layers, a network can be built by stacking blocks of layers. For example, a convolutional block might consist of several convolutional layers followed by pooling layers. These blocks are repeated multiple times to form a deeper network.

Why: Stacking blocks of layers allows for more efficient learning. Each block can perform specific types of feature extraction (like edge detection in convolutional layers), and by stacking them, the network can learn progressively more abstract features. For example, early blocks may detect edges in images, while later blocks can recognize more complex shapes or objects. This modular approach also improves the reusability of network components, which makes training deeper networks more manageable.

CV-Attention CV-Attention

CV-Object Detection CV-Object Detection

CV-Generate Model CV-Generate Model