Paper Review 9: You Only Look Once: Unified, Real-Time Object Detection (YOLOv1)

Partition with Grids and predict bounding boxes per Grid

How to implement Grids? → ConvNet!

1. Architecture

Divides input image into S x S grid
Each grid cell predicts B bounding boxes and confidence scores for those boxes
\[Confidence=P_r(Object)*IOU^{truch}_{pred}\]
- No object → 0 Else → we want it to equal to IOU between predicted box and ground truth
- Bounding box consists of 5 predictions: (x, y, w, h, confidence)
Each grid cell predicts C conditional class probabilities
\[P_r(Class_i|Object)\]
- Only predicts 1 set of class probabilities per grid cell, regardless of the number of boxes B
At test time, multiply conditional class probabilities and individual box confidence predictions, which gives class-specific confidence scores for each box
\[P_r(Class_i|Object)*P_r(Object)*IOU^{truth}_{pred}=P_r(Class_i)*IOU^{truth}_{pred}\]

Each grid consists of B bounding boxes(Confidence, x, y, h, w) and C class probabilities

\[(C_{b_1}, b_{1x}, b_{1y}, b_{1h}, b_{1w}, C_{b_2}, b_{2x}, b_{2y}, b_{2h}, b_{2w}, ...\ , c_1, c_2, ... \ ,c_n)\]

Implementation

2. Loss Function

Meaning of λ

Sum-squared error makes localization error equal to classification error → Not ideal
Many grid cells doesn’t contain any objects → Makes confidence scores towards zero
Remedy this with λ_coord and λ_noobj to increase loss from bounding box predictions and decrease loss from confidence predictions that don’t have objects.

Meaning of 1

\[\mathbf{1}_{i}^{obj} = if \ object \ appears\ in\ cell\ i \\ \mathbf{1}_{ij}^{obj} = jth\ bounding\ box\ predictor\ in\ cell\ i\ is\ responsible\ for\ prediction\]

Characteristics

Only penalizes classification error if an object is present in that grid cell
Only penalizes bounding box coordinate error if that predictor is responsible for the ground truth box

3. Advantages of YOLO

Extremly fast because it is just a regression problem and doesn’t need complex pipeline
Reasons globally about the image when making predictions
Learns generalizable representations of objects

4. Limitations of YOLO

Can predict limited number of nearby objects because each grid cell only predicts B boxes.
Struggles to generalize to objects in new or unusual aspect ratios or configurations
Loss function treats errors the same in small bounding boxes vs large bounding boxes
- Small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU