- tags
- Transformers, NLP, Computer vision
- paper
- (Radford et al. 2021)
Architecture
It is an encoder-only model which combines ViT and ResNet to encode images and a transformer for the text encoding.
Bibliography
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al.. . February 26, 2021DOI.
Loading comments...