Paper Review 9: You Only Look Once: Unified, Real-Time Object Detection (YOLOv1)
Partition with Grids and predict bounding boxes per Grid
How to implement Grids? → ConvNet!
1. Architecture
- Divides input image into S x S grid
-
Each grid cell predicts B bounding boxes and confidence scores for those boxes
\[Confidence=P_r(Object)*IOU^{truch}_{pred}\]- No object → 0 Else → we want it to equal to IOU between predicted box and ground truth
- Bounding box consists of 5 predictions: (x, y, w, h, confidence)
-
Each grid cell predicts C conditional class probabilities
\[P_r(Class_i|Object)\]- Only predicts 1 set of class probabilities per grid cell, regardless of the number of boxes B
-
At test time, multiply conditional class probabilities and individual box confidence predictions, which gives class-specific confidence scores for each box
\[P_r(Class_i|Object)*P_r(Object)*IOU^{truth}_{pred}=P_r(Class_i)*IOU^{truth}_{pred}\]
Each grid consists of B bounding boxes(Confidence, x, y, h, w) and C class probabilities
\[(C_{b_1}, b_{1x}, b_{1y}, b_{1h}, b_{1w}, C_{b_2}, b_{2x}, b_{2y}, b_{2h}, b_{2w}, ...\ , c_1, c_2, ... \ ,c_n)\]Implementation
2. Loss Function
Meaning of λ
- Sum-squared error makes localization error equal to classification error → Not ideal
- Many grid cells doesn’t contain any objects → Makes confidence scores towards zero
- Remedy this with λ_coord and λ_noobj to increase loss from bounding box predictions and decrease loss from confidence predictions that don’t have objects.
Meaning of 1
\[\mathbf{1}_{i}^{obj} = if \ object \ appears\ in\ cell\ i \\ \mathbf{1}_{ij}^{obj} = jth\ bounding\ box\ predictor\ in\ cell\ i\ is\ responsible\ for\ prediction\]Characteristics
- Only penalizes classification error if an object is present in that grid cell
- Only penalizes bounding box coordinate error if that predictor is responsible for the ground truth box
3. Advantages of YOLO
- Extremly fast because it is just a regression problem and doesn’t need complex pipeline
- Reasons globally about the image when making predictions
- Learns generalizable representations of objects
4. Limitations of YOLO
- Can predict limited number of nearby objects because each grid cell only predicts B boxes.
- Struggles to generalize to objects in new or unusual aspect ratios or configurations
- Loss function treats errors the same in small bounding boxes vs large bounding boxes
- Small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU