CLIP

Architecture

It is an encoder-only model which combines ViT and ResNet to encode images and a transformer for the text encoding.

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.. February 26, 2021. February 26, 2021DOI.