date: 2024-12-07
title: "Object Detection"
status: DONE
author:
- AllenYGY
tags:
- NOTE
- CV
- ObjectDetection
- RCNN
- Fast-RCNN
- Faster-RCNN
publish: True
Object Detection
Region Proposal (Selective Search) | |||
Feature Extraction (CNN) | |||
Classification (SVM) | Bounding-Box Regression |
Drawbacks:
Around 1K to 2K candidate regions are generated from an image using the Selective Search method.
For each candidate region, features are extracted using a deep convolutional neural network.
The extracted features are passed through multiple SVM classifiers (one per class) to determine if the region belongs to a particular object class.
Suppose we have 20 categories
For each category, we train a binary classifier by using SVM
Then we can get a weight matrix with size
Finally, we can get a
Each row represent a specific candidate region, and each column represent a specific category.
IoU: Intersection over Union
For each category, we need to do NMS
A regressor is used to refine and improve the precision of the bounding box location.
After Non-Maximum Suppression (NMS) processes the remaining proposal boxes, further refinement is performed. Specifically, 20 regressors are applied to refine the proposal boxes that belong to the corresponding 20 classes. This refinement operation results in the highest-scoring corrected bounding box for each class.
Improvements over R-CNN:
Advantages:
Limitations:
Generate 1K to 2K candidate regions for an image (using the Selective Search method).
Input the image into a network to obtain corresponding feature maps, and project the candidate bounding boxes generated by the Selective Search method onto the feature maps to get corresponding feature matrices.
ROI Pooling is a key operation used in object detection networks like Fast R-CNN. It allows feature maps of varying sizes (corresponding to different Region of Interest proposals) to be converted into a fixed-size feature map so that the final classification and bounding box regression can work consistently.
How ROI Pooling Works
Input:
A feature map from the CNN.
A set of Region of Interest (ROI) proposals (bounding boxes) that specify areas to focus on.
Divide the ROI into Grids:
Each ROI is divided into a fixed number of grid cells (e.g.,
This grid size is predefined based on the downstream network architecture.
Max-Pooling Within Each Grid Cell:
For each grid cell, take the portion of the feature map corresponding to that grid.
Apply max pooling to compress this region into a single value.
This ensures the output for each ROI is of uniform size, regardless of the original ROI dimensions.
Output:
A fixed-size feature map (e.g.,
Pass each feature matrix through an ROI pooling layer to resize it to a fixed size of 7x7 feature maps. Then, flatten the feature map and pass it through a series of fully connected layers to obtain the prediction results.
Key Innovation:
How it works:
Advantages:
Use Cases:
Definition: A Region of Interest (ROI) refers to a specific area in an image or video frame that is the focus of analysis. In object detection, the bounding box around an object is considered the ROI.
Usage:
Cropping parts of images for further processing.
Reducing computation by focusing only on the relevant area.