Inception score

The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative image model such as a generative adversarial network (GAN).[1] The score is calculated based on the output of a separate, pretrained Inceptionv3 image classification model applied to a sample of (typically around 30,000) images generated by the generative model. The Inception Score is maximized when the following conditions are true:

  1. The entropy of the distribution of labels predicted by the Inceptionv3 model for the generated images is minimized. In other words, the classification model confidently predicts a single label for each image. Intuitively, this corresponds to the desideratum of generated images being "sharp" or "distinct".
  2. The predictions of the classification model are evenly distributed across all possible labels. This corresponds to the desideratum that the output of the generative model is "diverse".[2]

It has been somewhat superseded by the related Fréchet inception distance.[3] While the Inception Score only evaluates the distribution of generated images, the FID compares the distribution of generated images with the distribution of a set of real images ("ground truth").

Definition

Let there be two spaces, the space of images Ω X {\displaystyle \Omega _{X}} and the space of labels Ω Y {\displaystyle \Omega _{Y}} . The space of labels is finite.

Let p g e n {\displaystyle p_{gen}} be a probability distribution over Ω X {\displaystyle \Omega _{X}} that we wish to judge.

Let a discriminator be a function of type

p d i s : Ω X M ( Ω Y ) {\displaystyle p_{dis}:\Omega _{X}\to M(\Omega _{Y})}
where M ( Ω Y ) {\displaystyle M(\Omega _{Y})} is the set of all probability distributions on Ω Y {\displaystyle \Omega _{Y}} . For any image x {\displaystyle x} , and any label y {\displaystyle y} , let p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} be the probability that image x {\displaystyle x} has label y {\displaystyle y} , according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet.

The Inception Score of p g e n {\displaystyle p_{gen}} relative to p d i s {\displaystyle p_{dis}} is

I S ( p g e n , p d i s ) := exp ( E x p g e n [ D K L ( p d i s ( | x ) p d i s ( | x ) p g e n ( x ) d x ) ] ) {\displaystyle IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)}
Equivalent rewrites include
ln I S ( p g e n , p d i s ) := E x p g e n [ D K L ( p d i s ( | x ) E x p g e n [ p d i s ( | x ) ] ) ] {\displaystyle \ln IS(p_{gen},p_{dis}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]}
ln I S ( p g e n , p d i s ) := H [ E x p g e n [ p d i s ( | x ) ] ] E x p g e n [ H [ p d i s ( | x ) ] ] {\displaystyle \ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]}
ln I S {\displaystyle \ln IS} is nonnegative by Jensen's inequality.

Pseudocode:

INPUT discriminator p d i s {\displaystyle p_{dis}} .

INPUT generator g {\displaystyle g} .

Sample images x i {\displaystyle x_{i}} from generator.

Compute p d i s ( | x i ) {\displaystyle p_{dis}(\cdot |x_{i})} , the probability distribution over labels conditional on image x i {\displaystyle x_{i}} .

Sum up the results to obtain p ^ {\displaystyle {\hat {p}}} , an empirical estimate of p d i s ( | x ) p g e n ( x ) d x {\displaystyle \int p_{dis}(\cdot |x)p_{gen}(x)dx} .

Sample more images x i {\displaystyle x_{i}} from generator, and for each, compute D K L ( p d i s ( | x i ) p ^ ) {\displaystyle D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)} .

Average the results, and take its exponential.

RETURN the result.

Interpretation

A higher inception score is interpreted as "better", as it means that p g e n {\displaystyle p_{gen}} is a "sharp and distinct" collection of pictures.

ln I S ( p g e n , p d i s ) [ 0 , ln N ] {\displaystyle \ln IS(p_{gen},p_{dis})\in [0,\ln N]} , where N {\displaystyle N} is the total number of possible labels.

ln I S ( p g e n , p d i s ) = 0 {\displaystyle \ln IS(p_{gen},p_{dis})=0} iff for almost all x p g e n {\displaystyle x\sim p_{gen}}

p d i s ( | x ) = p d i s ( | x ) p g e n ( x ) d x {\displaystyle p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx}
That means p g e n {\displaystyle p_{gen}} is completely "indistinct". That is, for any image x {\displaystyle x} sampled from p g e n {\displaystyle p_{gen}} , discriminator returns exactly the same label predictions p d i s ( | x ) {\displaystyle p_{dis}(\cdot |x)} .

The highest inception score N {\displaystyle N} is achieved if and only if the two conditions are both true:

  • For almost all x p g e n {\displaystyle x\sim p_{gen}} , the distribution p d i s ( y | x ) {\displaystyle p_{dis}(y|x)} is concentrated on one label. That is, H y [ p d i s ( y | x ) ] = 0 {\displaystyle H_{y}[p_{dis}(y|x)]=0} . That is, every image sampled from p g e n {\displaystyle p_{gen}} is exactly classified by the discriminator.
  • For every label y {\displaystyle y} , the proportion of generated images labelled as y {\displaystyle y} is exactly E x p g e n [ p d i s ( y | x ) ] = 1 N {\displaystyle \mathbb {E} _{x\sim p_{gen}}[p_{dis}(y|x)]={\frac {1}{N}}} . That is, the generated images are equally distributed over all labels.

References

  1. ^ Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc. arXiv:1606.03498.
  2. ^ Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas (December 2021). "Adversarial text-to-image synthesis: A review". Neural Networks. 144: 187–209. arXiv:2101.09983. doi:10.1016/j.neunet.2021.07.019. PMID 34500257. S2CID 231698782.
  3. ^ Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments". Computer Vision and Image Understanding. 215: 103329. arXiv:2103.09396. doi:10.1016/j.cviu.2021.103329. S2CID 232257836.