Paper Review 11: Fast R-CNN

1. Motivation

R-CNN has various problems, which are

Training is a multi-stage pipeline
Training is expensive in space and time because features are extracted from each object proposal in each image and written to disk
Object detection is slow

2. Architecture

Input

Entire image
Set of object proposals

Network

Processes whole image with ConvNet and max pooling layers to produce a feature map
For each object proposal, RoI pooling layer extracts fixed-length feature vector from feature map
Each vector is fed into FC layer

Output

Softmax probability which estimates over K object classes + “background” class
4 real numbers which encodes bounding-box positions for one of the K classes

2-1. RoI pooling layer

RoI pooling layer uses max pooling to convert features inside any valid RoI into a small feature map with a fixed spatial extent of H x W

2-2. Init from pre-trained networks

Use pre-trained network with transformations below

Last max pooling layer is replaced by RoI pooling layer
Network’s last FC layer and softmax are replaced with Fast R-CNN’s output layer
Network is modified to take 2 data inputs

2-3 Fine-tuning: Multi-task loss

Each RoI is labeled with ground-class u and ground-truth bounding box regression target v

\[L(p,u,t^u,v)=L_{\text{cls}}(p,u)+λ[u\geq 1]L_{\text{loc}}(t^u,v)\]

L_cls = log loss for true class u
\[L_{\text{cls}}(p,u)=-\log{p_u}\]
L_loc = bounding box regression loss

L_loc is defined over a tuple of true bounding-box regression targets for class u and predicted tuple t^u.
\[v = (v_x, v_y, v_w, v_h)\\ t^u=(t_x^u, t_y^u, t_w^u, t_h^u)\] \[L_{\text{loc}}(t^u,v)=\sum_{i\in \{x,y,w,h\}} \text{smooth}_{L_1}(t_i^u-v_i) \\ \text{smooth}_{L_1}(x) = \begin{cases} 0.5x^2 & \text{if } |x| < 1 \\ |x| - 0.5 & \text{otherwise,} \end{cases}\]
1. λ : hyper-parameter that controls the balance between two losses
2. [u ≥ 1] : Because for background RoIs, there is no notion of ground-truth bounding box, L_loc should be ignored
3. smooth_L1 : Robust L1 loss that is less sensitive to outliers than L2 loss

2-4. Back-propagation through RoI pooling layers

\[x_i \in \mathbb{R} = \text{i-th activation input into RoI pooling layer}\\ y_{rj} = \text{layer's j-th output from r-th RoI}\]

RoI pooling layer computes

\[y_{rj}=x_{i^*(r,j)} \\ i^*(r,j)=\text{argmax}_{i'\in R(r,j)}x_{i'}R(r,j)\]

where i* is index set of inputs in the sub-window over which output unit y_rj max pools

Hence, backward propagation is computed with partial derivative of the loss function with respect to each input variable x_i by following the argmax switches

\[\frac{∂L}{∂x_i}=\sum_r\sum_j[i=i^*(r,j)]\frac{∂L}{∂y_{rj}}\]

3. Performing Detection

Perform after Fast R-CNN is fine-tuned

Input: image + R object proposals
For each test RoI r, perform Forward pass → probability distribution p + predicted bounding box offsets relative to r
Assign detection confidence to r for each object class k using estimated probability
\[P_r(\text{class}=k\ |\ r) = p_k\]
Perform non-maximum suppression

3-1. Truncated SVD

Large FC layers are easily accelerated by compressing them with truncated SVD

\[W ≈ U\Sigma_tV^T\]

Truncated SVD reduces param count from uv → t(u+v)

FC layer can be replaced by 2 FC layers without non-linearity between them

4. Advantage

Higher mAP
Training is single-stage using multi-task loss
Training can update all network layers
No disk storage is required for feature caching