CLIP: Connecting Text and Images

Paper: Learning Transferable Visual Models From Natural Language Supervision

General Terms:

  1. Contrastive Learning: It is based on the intuition that you can contrast/differentiate between similar and dissimilar things. In machine learning model, we formulate this as a task of finding similar and dissimilar things i.e. the model should be able to classify between similar and dissimilar images

  2. Zero-shot Learning : Zero-shot learning is a problem, where at test stage,the model/learner aims to recognize objects whose instances maynot have been seen during training. To learn about various works that have been done in this space (atleast till 2018), this paper provides good details: Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly.

The paper describes an approach where it takes a large dataset of image text pair and tries to learn a model that scores whether a image and text could co-occur. This is learned over a large dataset.

Question: How to do classification zero shot i.e. without any training?

Given a specified classification task where there are some images and some labels and you are supposed to evaluate based on our prediction, you will embed all text labels into vectors and images into vectors and them compare the score of their cross product. The pairing that is going to give the highest score is going to be the prediction label.

Fig 1. Photo via Open AI Paper

In Fig.1, $N$ is the size of images associated with some text. $T_i$ is the encoding for the entire text string.

Why is it zero shot? This model is zero shot because it refers to the number of lables we see for training. In this architecture, for a particular evaluation task, you do not train. You take the train weights(and give some bias). THis notion of zero shot is in line with GPT notion of zero shot i.e. you do a lot of training - you don’t care what is happenng at the pre-training stage


  • Called WiT Dataset (WebImageText)
  • 400 million (image, text) pairs by searching over text queries
  • 500,000 text queries:
    • All words occuring atleast 100 times in the English version of Wikipedia + WordNet synsets + some details
  • Cap at a maximum of 20000 (image, text) pairs per query
  • Total wordcount similar to WebText dataset used by GPT-2

Note: There is a difference between queries and labels. Example, when you give a query ‘dog’ on google image, it will come with all sorts of images of ‘dog’ and each such image will come associated with a (paired)text. These texts can be in the form of alt text or title of the page.

Fig 2. Photo via Open AI Paper

Fig.2 shows various methods authors use to test which pre-training method is better. We know there are various training method that connects image and language. Most vanilla verison is captioning (language modelling with images). In this paper, authors try a variety of pre-training methods. First for captioning - they found it is not very compute efficient. Second, they tried BoW (bag of words) and they find that they scale better. However, the best they found was contrastive learning.

Model architecture:

Image Encode

  • ResNet
    • Modifications: anti-aliased max pooling; final global pooling by QK Value (QKV) attention
    • Scaling model size by allocating compute equally (to width, depth, input resolution) OR
  • Vision Transformer

Text Encoder

  • Transformer text encoder
    • scaling model size by width only; do not scale depth

Other parameters
  • $N$ x $N$ affinity matrix. Symmetric Contrastic Loss
  • Temperature $t$ is initialized to 0.07 but allowed to learn
  • Adam with decoupled weight decay ($adam_w$). Cosine LR.
  • Batch size 32768 (this is huge!)

Prompt Engineering:

From Fig 3.a, we can see that in addition to texts, authors have also experimented using prompts wth texts i.e. you can embed the label itself but you can also embed string with a customized prompt ahead of it. E.g. ‘a photo of ‘, ‘a bag of’. Fig 3.b shows that authors show that prompts work better than raw label (+15%)


Fig 4. Photo via Open AI Paper from Appendix Section

For ImageNet (1.2 million), vanilla ResNet50 gives accuracy of 56.3% [1] and zero-shot (CLIP model) gives 59.6%.(see last column of first row in Fig.4). Model (L/14-336x) that gives classification accuracy close to ResNet-50 uses a lot more compute (see last column, last row in Fig.4). L/14-336x means 14 large with input resolution 336px. Hence, to surpass a supervised ResNet, you need a lot bigger model

[1] :

Acknowledment: Thanks to the discussion in TTIC reading group which introduced me to this paper.

Srishti Yadav
Srishti Yadav
ML Researcher

My research interest include applying computationally intensive machine learning algorithm to image or text based data