Object Detection

RCCN

RCNN

Region Proposal (Selective Search)
Feature Extraction (CNN)
Classification (SVM) Bounding-Box Regression

Drawbacks:

  • Very slow due to separate steps for region proposals, feature extraction, and classification.
  • Needs multiple models (CNN + SVM + Regressor).
  • Computationally expensive because each region is processed individually.

1. Generate Region Proposals

Around 1K to 2K candidate regions are generated from an image using the Selective Search method.

2. Feature Extraction

For each candidate region, features are extracted using a deep convolutional neural network.

  • Resize to

3. Classification

The extracted features are passed through multiple SVM classifiers (one per class) to determine if the region belongs to a particular object class.

  • For each selected Region, we can get a 4096 dimensions vector

Suppose we have 20 categories

  • For each category, we train a binary classifier by using SVM

  • Then we can get a weight matrix with size

  • Finally, we can get a matrix

  • Each row represent a specific candidate region, and each column represent a specific category.

Classification

Non-Maximum Suppression (NMS)

IoU: Intersection over Union

For each category, we need to do NMS

  1. Find the highest Target,
  2. Calculate the IoU with other region, if the IoU is higher than the threshold, then remove it.
  3. Repeat the process until all the regions are processed.

4. Bounding Box Refinement

A regressor is used to refine and improve the precision of the bounding box location.

After Non-Maximum Suppression (NMS) processes the remaining proposal boxes, further refinement is performed. Specifically, 20 regressors are applied to refine the proposal boxes that belong to the corresponding 20 classes. This refinement operation results in the highest-scoring corrected bounding box for each class.

Fast-RCNN

Fast-RCNN

Improvements over R-CNN:

  • Processes the entire image once to generate a shared feature map.
  • Region proposals are projected onto the feature map instead of the original image.
  • Introduces ROI Pooling to extract fixed-size feature maps for each region.
  • Uses a single neural network to perform both classification and bounding box regression.

Advantages:

  • Much faster than R-CNN (reduced redundant feature extraction).
  • End-to-end training.

Limitations:

  • Still relies on an external Selective Search algorithm for region proposals, which is slow.

1. Generate Region Proposals

Generate 1K to 2K candidate regions for an image (using the Selective Search method).

2. Feature Extraction

Input the image into a network to obtain corresponding feature maps, and project the candidate bounding boxes generated by the Selective Search method onto the feature maps to get corresponding feature matrices.

ROI Pooling

ROI Pooling is a key operation used in object detection networks like Fast R-CNN. It allows feature maps of varying sizes (corresponding to different Region of Interest proposals) to be converted into a fixed-size feature map so that the final classification and bounding box regression can work consistently.

How ROI Pooling Works

Input:

  • A feature map from the CNN.

  • A set of Region of Interest (ROI) proposals (bounding boxes) that specify areas to focus on.

Divide the ROI into Grids:

  • Each ROI is divided into a fixed number of grid cells (e.g., ).

  • This grid size is predefined based on the downstream network architecture.

Max-Pooling Within Each Grid Cell:

  • For each grid cell, take the portion of the feature map corresponding to that grid.

  • Apply max pooling to compress this region into a single value.

  • This ensures the output for each ROI is of uniform size, regardless of the original ROI dimensions.

Output:

A fixed-size feature map (e.g., ) for each ROI, ready for further processing like classification and bounding box regression.

3. Classification

Pass each feature matrix through an ROI pooling layer to resize it to a fixed size of 7x7 feature maps. Then, flatten the feature map and pass it through a series of fully connected layers to obtain the prediction results.

Faster-RCNN

Key Innovation:

  • Replaces the slow Selective Search with a Region Proposal Network (RPN).

How it works:

  • RPN shares convolutional layers with the main detection network to generate region proposals directly from the feature map.
  • The network simultaneously predicts object scores and refines bounding boxes.

Advantages:

  • Much faster than Fast R-CNN.
  • End-to-end trainable.

Use Cases:

  • Widely used in real-world applications such as autonomous vehicles, video surveillance, etc.

What are the differences among different networks? How to speed up the detection?

Region of Interest (ROI)

  • Definition: A Region of Interest (ROI) refers to a specific area in an image or video frame that is the focus of analysis. In object detection, the bounding box around an object is considered the ROI.

  • Usage:
    Cropping parts of images for further processing.
    Reducing computation by focusing only on the relevant area.