CLIP

tags
Transformers, NLP, Computer vision
paper
(Radford et al. 2021)

Architecture

It is an encoder-only model which combines ViT and ResNet to encode images and a transformer for the text encoding.

Bibliography

  1. . . February 26, 2021DOI.

Links to this note

Last changed | authored by

Comments

Loading comments...

Leave a comment

Back to Notes