
Uses CNN features that contains only “style” or “content” representations, not exact pixel location, and Mix them


Content Loss

  • visualize directly from feature map of CNN
    • Higher layer = high level content
    • Lower layer = Exact pixel values
  • Use Gradient descent from white noise image to match content representation
  • Notation

    \[\vec{p}:\text{original image}\\ \vec{x}:\text{generated image}\\ P^l, F^l: \text{feature representation in layer }l\]
  • Loss: squared-error loss between 2 feature representations

  • Derivative of the loss

    \[\frac{\partial \mathcal{L}_{\text{content}}}{\partial F_{ij}^l} = \begin{cases} (F_{ij}^l - P_{ij}^l) & \text{if } F_{ij}^l > 0 \\ 0 & \text{if } F_{ij}^l < 0 \end{cases}\]

Style Loss

  • Style = Correlation between feature map channels of the layer
    Why use correlation? → To remove pixel value’s impact and get real style!
    • Represented by: Gram Matrix (G) ( inner product between vectorized feature map i, j in layer l )

  • Use Gradient descent from white noise image to match style representation
  • Notation

    \[\vec{a}:\text{original image}\\ \vec{x}:\text{generated image} \\ A^l, G^l:\text{style representation in layer }l\]
  • Loss: minimizing mean-squared distance between Gram matrix of original image and Gram matrix of generated image
    • Contribution of the layer to total loss

    • Total Loss (w_l = weighting factors)

  • Derivative of E_l

    \[\frac{\partial E_l}{\partial F_{ij}^l} =\begin{cases}    \frac{1}{N_l^2 M_l^2} \left( (F^l)^\top (G^l - A^l) \right)_{ji} & \text{if } F_{ij}^l > 0 \\    0 & \text{if } F_{ij}^l < 0\end{cases}\]

Total Loss
