CLIP - 图像文本多模态模型

博主： AIHGF
发布时间：2022 年 01 月 31 日
6289 次浏览
2 条评论
10193字数
分类：图像检索网络模型

多模态模型，这里主要是视觉和文本2个模态交互的模型。

一般的思路是，分别对文本和视觉进行特征提取，得到embedding，然后进行early或者late fusion，再根据预训练任务进行参数更新。

[1] - 视觉特征的提取

直接提取视觉向量；
目标检测，大多用的是Faster R-CNN，输出bounding box和对应位置；提取类别特征
潜在问题：提取的信息有冗余和噪音，对下游任务有益的视觉信息没有被有效提取，和对应的文本信息有语义鸿沟，不好做视觉-文本对齐等。

[2] - 文本特征的提取

文本一般会用BERT等做初始化，大规模训练集会从头开始训练。

Early fusion：特征提取后，一般是直接concat，进入transformer；
Late fusion：特征提取后，模态内部再进行深度学习，再做模态交互。

计算机视觉：计算机具有看到并理解其所看到的东西的能力，以类似于人类；

自然语言处理：计算机具有理解语言的能力，以类似于人类的方式。

CLIP 架起了计算机视觉与自然语言处理之间的桥梁.

CLIP

CLIP，Contrastive Language–Image Pre-training

[1] - 双流模型，文本和视觉分别进入transformer encoder，经过线性投影计算不同图文对的相似度；

[2] - 使用对比学习，将图片分类转换成图文匹配任务。

用4亿对来自网络的图文数据集，将文本作为图像标签，进行训练。进行下游任务时，只需要提供和图上的concepts对应的文本描述，就可以进行zero-shot transfer。

[3] - 输入图片到 CLIP 模型，可以返回与图片标题（描述）或摘要.

数据：4亿个网络公开的图文对。为覆盖到更多的视觉concepts，用了50w个query在搜索引擎搜索图片，一个query差不多有2w张图片。
输入：一个batch有N个图像文本对；

模型：对比学习，预测 $N\times N$ 对图文数据，将图片分类任务转换成图文匹配任务：

[1] - 双流，2个encoder分别处理文本和图片数据，text encoder使用Transformer，image encoder用了2种模型，ResNet和Vision Transformer(ViT)；

5种ResNet：ResNet-50, ResNet-101, EfficientNet-style的ResNet，包括RN50x4, RN50x16, RN50x64；
3种ViT：ViT-B/32, ViT-B/16, ViT-L/14；

[2] - encoder representation直接线性投影到multi-modal embedding space；

[3] - 计算2模态之间的cosine similarity，让N个匹配的图文对相似度最大，不匹配的图文对相似度最小；

[4] - 对称的cross-entropy loss；

[5] - 数据增强：对resized图片进行random square crop。

[6] - 将分类模型转换成图文匹配任务，用文本来弱监督图片分类。

实验

Zero-shot Transfer

图片分类的zero-shot指的是对未知类别进行推理。

本文的zero-shot指的是对未知任务进行推理，通过zero-shot transfer衡量任务学习的能力。

Visual N-Grams (Li et al., 2017) 是第一个将zero-shot transfer应用到图片分类任务上的模型。模型用于学习长度为1~5grams的共142806个visual n-grams，对输入的图片，最大化对应的n-grams的probability。

同样的，CLIP在进行zero-shot transfer时，将数据集中的类别标签转换为文字描述，主要步骤如下：

输入：一张图片 + 所有类别转换的文本（100个类别就是100个文本描述）；
转换向量：经过2个encoder，分别输出image和text的feature embedding；
计算cosine similarity；
预测类别：multinomial logistic regression classifier。

zero-shot prediction 代码

zero-shot prediction

基于输入的图片，在类别描述中检索，找到最合适的类别。

"Ref：https://github.com/openai/CLIP"
import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser("~/.cache"), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f"a photo of a {c}") for c in cifar100.classes]).to(device)
#cifar每个类别，输入图片，检索匹配的类别

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{cifar100.classes[index]:>16s}: {100 * value.item():.2f}%")

"""
Top predictions:
           snake: 65.31%
          turtle: 12.29%
    sweet_pepper: 3.83%
          lizard: 1.88%
       crocodile: 1.75%
"""

Linear-probe evaluation

将 pretraining 的模型视作特征提取器，增加一个分类头，只训练这个分类头。第二种方式的直觉在于“一个好的特征应该能够区分不同的类”，除此之外，fine-tune 效果好有可能是因为架构很适合下游任务，但是 linear-probe 只取决于特征质量。

通过CLIP的image_encoder得到视觉向量，结合标签做Logistic Regression。

"Ref：https://github.com/openai/CLIP"
import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# Load the model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load('ViT-B/32', device)

# Load the dataset
root = os.path.expanduser("~/.cache")
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)

def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

            all_features.append(features)
            all_labels.append(labels)

    return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1) # c自定义
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(np.float)) * 100.
print(f"Accuracy = {accuracy:.3f}")

CLIP 局限

[1] - 不是和SOTA的比较：以上的数据分析，都是和a linear classifier on top of ResNet-50 features进行比较，大部分的数据集，都有对应的SOTA模型。为了达到SOTA，zero-shot CLIP估计要提高1000x的算力，当前情况不支持；

[2] - 在部分fine-grained分类上表现不佳：

a. 前面实验分析发现，模型不能很好的区分cars，species of flowers, 以及variants of aircraft；
b. abstract和systematic任务表现不好，比如统计图上object的数量；
c. 在训练集中基本不会出现的比较novel的任务，表现欠佳，比如classifying
the distance to the nearest car in a photo；

[3] - 训练集中没有出现的图片类型（out-of-distribution），表现不好，比如OCR识别数字效果可以，但是MNIST的准确率只有88%；

其余

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation

https://arxiv.org/pdf/2107.07651.pdf
https://github.com/salesforce/ALBEF

提取的视觉token和文本token不对齐，做图文特征交叉是个难点。
之前的模型，使用bounding box提取region feature，本文提出ALBEF，通过cross-modal attention，实现图文表征学习。且为了从noisy的web数据上提升模型效果，提出momentum distillation，这是一种self-training method，学习pseudo-targets。实验结果显示，模型在多个下游任务上，达到SOTA。

应用

1. CLIP-Art

CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification-CVPR2021
https://github.com/KeremTurgutlu/clip_art

解决两个问题：

实例检索(instance retrieval)
细粒度艺术属性识别(fine-grained artwork attribute recognition)

将给定图片的噪声细粒度类别标注转为自然语言文本；iMet 数据集，每张图片生成了超过 15 的文本描述
微调 ViT-B/32 CLIP 模型，InfoNCE loss（使正样本对之间的互信息最大，使负样本对之间的互信息最小）
域适应$CLIP_{art}$ 以进一步微调到细粒度艺术识别任务

2. CLIP + StyleGAN

https://github.com/vipermu/StyleCLIP
used CLIP and StyleGAN to generate portraits in the style of "My Little Pony."
https://twitter.com/metasemantic/status/1368713208429764616

A pony that looks like Elvis Presley

Generating Images from Prompts using CLIP and StyleGAN

3. 语义搜索

https://twitter.com/metasemantic/status/1349446585952989186

4. 跨模态搜索

5. 图像分类

CLIP vs. ResNext

6. StyleGAN-NADA

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators
https://github.com/rinongal/StyleGAN-nada
https://stylegan-nada.github.io/

7. StyleCLIP

StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
https://github.com/orpatashnik/StyleCLIP

材料

[1] - A Beginner’s Guide to the CLIP Model

[2] - CLIP from OpenAI: what is it and how you can try it out yourself

[3] - Simple Implementation of OpenAI CLIP model: A Tutorial

[4] - Generating Images from Prompts using CLIP and StyleGAN

最后修改：2022 年 02 月 11 日

如果觉得我的文章对你有用，请随意赞赏

2 条评论

yangapku
November 11th, 2022 at 03:43 pm

大家好，我们在github开源了中文Chinese-CLIP模型，有多个模型规模可选，下游效果也还不错，附带详细的技术报告和demo https://github.com/OFA-Sys/Chinese-CLIP 希望大家多多试用 & star，多提宝贵意见～

回复
1. AIHGF
  November 13th, 2022 at 10:33 pm
  
  @yangapku
  
  有关注到过，达摩院，赞！
  
  回复

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

CLIP - 图像文本多模态模型

AIHGF • 2022 年 01 月 31 日

多模态模型，这里主要是视觉和文本2个模态交互的模型。一般的思路是，分别对文本和视觉进行特征提取，得到embedding，然后进行early或者late fusion，再根据预训练任务进行参数更新。[1] - 视觉特征的提取<ul><li>直接提取视觉向量；</li><li>目标检测，大多用的是Faster R-CNN，输出bounding box和对应位置；提取类别特征</li><li>潜在问题：提取的信息有冗余和噪音，对下游任务有益的视觉信息没有被有效提取，和对应的文本信息有语义鸿沟，不好做视觉-文本对齐等。</li></ul>[2] - 文本特征的提取<ul><li>文本一般会用BERT等做初始化，大规模训练集会从头开始训练。</li></ul>Early fusion：特征提取后，一般是直接concat，进入transformer； Late fusion：特征提取后，模态内部再进行深度学习，再做模态交互。计算机视觉：计算机具有看到并理解其所看到的东西的能力，以类似于人类；自然语言处理：计算机具有理解语言的能力，以类似于人类的方式。CLIP 架起了计算机视觉与自然语言处理之间的桥梁.<h2>CLIP</h2>CLIP，Contrastive Language–Image Pre-training[1] - 双流模型，文本和视觉分别进入transformer encoder，经过线性投影计算不同图文对的相似度；[2] - 使用对比学习，将图片分类转换成图文匹配任务。用4亿对来自网络的图文数据集，将文本作为图像标签，进行训练。进行下游任务时，只需要提供和图上的concepts对应的文本描述，就可以进行zero-shot transfer。[3] - 输入图片到 CLIP 模型，可以返回与图片标题（描述）或摘要.数据：4亿个网络公开的图文对。为覆盖到更多的视觉concepts， 用了50w个query在搜索引擎搜索图片，一个query差不多有2w张图片。 输入：一个batch有N个图像文本对；<img src="https://upload-images.jianshu.io/upload_images/11071664-932ea0a7b0d3fe69.png?imageMogr2/auto-orient/strip|imageView2/2/w/1200/format/webp" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/harCex.png" alt="" title="" style="">模型：对比学习，预测<img src="https://math.jianshu.com/math?formula=N%5Ctimes%20N" alt="N\times N" title="N\times N" style="">对图文数据，将图片分类任务转换成图文匹配任务：[1] - 双流，2个encoder分别处理文本和图片数据，text encoder使用Transformer，image encoder用了2种模型，ResNet和Vision Transformer(ViT)；<ul><li>5种ResNet：ResNet-50, ResNet-101, EfficientNet-style的ResNet，包括RN50x4, RN50x16, RN50x64；</li><li>3种ViT：ViT-B/32, ViT-B/16, ViT-L/14；</li></ul>[2] - encoder representation直接线性投影到multi-modal embedding space；[3] - 计算2模态之间的cosine similarity，让N个匹配的图文对相似度最大，不匹配的图文对相似度最小；[4] - 对称的cross-entropy loss；[5] - 数据增强：对resized图片进行random square crop。[6] - 将分类模型转换成图文匹配任务，用文本来弱监督图片分类。<img src="https://upload-images.jianshu.io/upload_images/11071664-c3a189d0d14eea15.png?imageMogr2/auto-orient/strip|imageView2/2/w/1172/format/webp" alt="" title="" style=""><h2>实验</h2><h3>Zero-shot Transfer</h3>图片分类的zero-shot指的是对未知类别进行推理。本文的zero-shot指的是对未知任务进行推理，通过zero-shot transfer衡量任务学习的能力。Visual N-Grams (Li et al., 2017) 是第一个将zero-shot transfer应用到图片分类任务上的模型。模型用于学习长度为1~5grams的共142806个visual n-grams，对输入的图片，最大化对应的n-grams的probability。同样的，CLIP在进行zero-shot transfer时，将数据集中的类别标签转换为文字描述，主要步骤如下：<ol><li>输入：一张图片 + 所有类别转换的文本（100个类别就是100个文本描述）；</li><li>转换向量：经过2个encoder，分别输出image和text的feature embedding；</li><li>计算cosine similarity；</li><li>预测类别：multinomial logistic regression classifier。</li></ol><h3>zero-shot prediction 代码</h3>zero-shot prediction基于输入的图片，在类别描述中检索，找到最合适的类别。<pre><code class="lang-python">&quot;Ref：https://github.com/openai/CLIP&quot;
import os
import clip
import torch
from torchvision.datasets import CIFAR100

# Load the model
device = &quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;
model, preprocess = clip.load('ViT-B/32', device)

# Download the dataset
cifar100 = CIFAR100(root=os.path.expanduser(&quot;~/.cache&quot;), download=True, train=False)

# Prepare the inputs
image, class_id = cifar100[3637]
image_input = preprocess(image).unsqueeze(0).to(device)
text_inputs = torch.cat([clip.tokenize(f&quot;a photo of a {c}&quot;) for c in cifar100.classes]).to(device)
#cifar每个类别，输入图片，检索匹配的类别

# Calculate features
with torch.no_grad():
    image_features = model.encode_image(image_input)
    text_features = model.encode_text(text_inputs)

# Pick the top 5 most similar labels for the image
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)
values, indices = similarity[0].topk(5)

# Print the result
print(&quot;\nTop predictions:\n&quot;)
for value, index in zip(values, indices):
    print(f&quot;{cifar100.classes[index]:&gt;16s}: {100 * value.item():.2f}%&quot;)

&quot;&quot;&quot;
Top predictions:
 snake: 65.31%
 turtle: 12.29%
 sweet_pepper: 3.83%
 lizard: 1.88%
 crocodile: 1.75%
&quot;&quot;&quot;</code></pre><h3>Linear-probe evaluation</h3>将 pretraining 的模型视作特征提取器，增加一个分类头，只训练这个分类头。第二种方式的直觉在于“一个好的特征应该能够区分不同的类”，除此之外，fine-tune 效果好有可能是因为架构很适合下游任务，但是 linear-probe 只取决于特征质量。通过CLIP的image_encoder得到视觉向量，结合标签做Logistic Regression。<pre><code class="lang-python">&quot;Ref：https://github.com/openai/CLIP&quot;
import os
import clip
import torch

import numpy as np
from sklearn.linear_model import LogisticRegression
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR100
from tqdm import tqdm

# Load the model
device = &quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;
model, preprocess = clip.load('ViT-B/32', device)

# Load the dataset
root = os.path.expanduser(&quot;~/.cache&quot;)
train = CIFAR100(root, download=True, train=True, transform=preprocess)
test = CIFAR100(root, download=True, train=False, transform=preprocess)

def get_features(dataset):
    all_features = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(DataLoader(dataset, batch_size=100)):
            features = model.encode_image(images.to(device))

all_features.append(features)
            all_labels.append(labels)

return torch.cat(all_features).cpu().numpy(), torch.cat(all_labels).cpu().numpy()

# Calculate the image features
train_features, train_labels = get_features(train)
test_features, test_labels = get_features(test)

# Perform logistic regression
classifier = LogisticRegression(random_state=0, C=0.316, max_iter=1000, verbose=1) # c自定义
classifier.fit(train_features, train_labels)

# Evaluate using the logistic regression classifier
predictions = classifier.predict(test_features)
accuracy = np.mean((test_labels == predictions).astype(np.float)) * 100.
print(f&quot;Accuracy = {accuracy:.3f}&quot;)</code></pre><h3>CLIP 局限</h3>[1] - 不是和SOTA的比较：以上的数据分析，都是和a linear classifier on top of ResNet-50 features进行比较，大部分的数据集，都有对应的SOTA模型。为了达到SOTA，zero-shot CLIP估计要提高1000x的算力，当前情况不支持；[2] - 在部分fine-grained分类上表现不佳：<ul><li>a. 前面实验分析发现，模型不能很好的区分cars，species of flowers, 以及variants of aircraft；</li><li>b. abstract和systematic任务表现不好，比如统计图上object的数量；</li><li>c. 在训练集中基本不会出现的比较novel的任务，表现欠佳，比如classifying the distance to the nearest car in a photo；</li></ul>[3] - 训练集中没有出现的图片类型（out-of-distribution），表现不好，比如OCR识别数字效果可以，但是MNIST的准确率只有88%；<h2>其余</h2>Align before Fuse: Vision and Language Representation Learning with Momentum Distillation<blockquote><a class="no-external-link" href="https://links.jianshu.com/go?to=https%3A%2F%2Farxiv.org%2Fpdf%2F2107.07651.pdf" target="_blank">https://arxiv.org/pdf/2107.07651.pdf</a> <a class="no-external-link" href="https://links.jianshu.com/go?to=https%3A%2F%2Fgithub.com%2Fsalesforce%2FALBEF" target="_blank">https://github.com/salesforce/ALBEF</a></blockquote>提取的视觉token和文本token不对齐，做图文特征交叉是个难点。 之前的模型，使用bounding box提取region feature，本文提出ALBEF，通过cross-modal attention，实现图文表征学习。且为了从noisy的web数据上提升模型效果，提出momentum distillation，这是一种self-training method，学习pseudo-targets。实验结果显示，模型在多个下游任务上，达到SOTA。<img src="https://upload-images.jianshu.io/upload_images/11071664-b7b325e98d31b7d0.png?imageMogr2/auto-orient/strip|imageView2/2/w/1080/format/webp" alt="" title="" style=""><h2>应用</h2><h3>1. CLIP-Art</h3><blockquote><a class="no-external-link" href="https://openaccess.thecvf.com/content/CVPR2021W/CVFAD/papers/Conde_CLIP-Art_Contrastive_Pre-Training_for_Fine-Grained_Art_Classification_CVPRW_2021_paper.pdf" target="_blank">CLIP-Art: Contrastive Pre-training for Fine-Grained Art Classification-CVPR2021</a><a class="no-external-link" href="https://github.com/KeremTurgutlu/clip_art" target="_blank">https://github.com/KeremTurgutlu/clip_art</a></blockquote>解决两个问题：<ul><li>实例检索(instance retrieval)</li><li>细粒度艺术属性识别(fine-grained artwork attribute recognition)</li></ul><img src="https://static01.imgkr.com/temp/73f309a980f044d8be445dd171d710a7.png" alt="" title="" style=""><ul><li>将给定图片的噪声细粒度类别标注转为自然语言文本；iMet 数据集，每张图片生成了超过 15 的文本描述</li><li>微调 ViT-B/32 CLIP 模型，InfoNCE loss（使正样本对之间的互信息最大，使负样本对之间的互信息最小）</li><li>域适应$CLIP_{art}$ 以进一步微调到细粒度艺术识别任务</li></ul><h3>2. CLIP + StyleGAN</h3><blockquote><a class="no-external-link" href="https://github.com/vipermu/StyleCLIP" target="_blank">https://github.com/vipermu/StyleCLIP</a>used CLIP and StyleGAN to generate portraits in the style of "My Little Pony."<a class="no-external-link" href="https://twitter.com/metasemantic/status/1368713208429764616" target="_blank">https://twitter.com/metasemantic/status/1368713208429764616</a></blockquote>A pony that looks like Elvis Presley<img src="https://z3.ax1x.com/2021/08/31/haaKeO.png" alt="haaKeO.png" title="haaKeO.png" style=""><img src="https://z3.ax1x.com/2021/08/31/haaam8.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haaypn.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haaf7F.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haaTpR.png" alt="" title="" style=""><blockquote>Generating Images from Prompts using CLIP and StyleGAN</blockquote><img src="https://z3.ax1x.com/2021/08/31/hasWvj.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/harz4S.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/hasKgJ.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/hasagH.png" alt="" title="" style=""><h3>3. 语义搜索</h3><blockquote><a class="no-external-link" href="https://twitter.com/metasemantic/status/1349446585952989186" target="_blank">https://twitter.com/metasemantic/status/1349446585952989186</a></blockquote><img src="https://i.loli.net/2021/08/31/TW23yJ48VPmOkic.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/hadyUe.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haw8Mt.png" alt="" title="" style=""><h3>4. 跨模态搜索</h3><img src="https://z3.ax1x.com/2021/08/31/haBKgA.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haB1DP.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haBU3j.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haDVrq.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/haDmZV.png" alt="" title="" style=""><h3>5. 图像分类</h3><a class="no-external-link" href="https://ichi.pro/de/bildklassifizierung-clip-und-resnext-167315391175093" target="_blank">CLIP vs. ResNext</a><img src="https://z3.ax1x.com/2021/08/31/hagHm9.png" alt="" title="" style=""><h3>6. StyleGAN-NADA</h3><blockquote><a class="no-external-link" href="https://arxiv.org/pdf/2108.00946.pdf" target="_blank">StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators</a><a class="no-external-link" href="https://github.com/rinongal/StyleGAN-nada" target="_blank">https://github.com/rinongal/StyleGAN-nada</a><a class="no-external-link" href="https://stylegan-nada.github.io/" target="_blank">https://stylegan-nada.github.io/</a></blockquote><img src="https://z3.ax1x.com/2021/08/31/ha6BAP.png" alt="" title="" style=""><h3>7. StyleCLIP</h3><blockquote><a class="no-external-link" href="https://arxiv.org/pdf/2103.17249.pdf" target="_blank">StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery</a><a class="no-external-link" href="https://github.com/orpatashnik/StyleCLIP" target="_blank">https://github.com/orpatashnik/StyleCLIP</a></blockquote><img src="https://z3.ax1x.com/2021/08/31/hacnC8.png" alt="" title="" style=""><img src="https://z3.ax1x.com/2021/08/31/hac0KJ.gif" alt="" title="" style=""><h2>材料</h2>[1] - <a class="no-external-link" href="https://www.kdnuggets.com/2021/03/beginners-guide-clip-model.html" target="_blank">A Beginner’s Guide to the CLIP Model</a>[2] - <a class="no-external-link" href="https://inmeta.medium.com/clip-from-openai-what-is-it-and-how-you-can-try-it-out-yourself-6f9a870efe00" target="_blank">CLIP from OpenAI: what is it and how you can try it out yourself</a>[3] - <a class="no-external-link" href="https://towardsdatascience.com/simple-implementation-of-openai-clip-model-a-tutorial-ace6ff01d9f2" target="_blank">Simple Implementation of OpenAI CLIP model: A Tutorial</a>[4] - <a class="no-external-link" href="https://towardsdatascience.com/generating-images-from-prompts-using-clip-and-stylegan-1f9ed495ddda" target="_blank">Generating Images from Prompts using CLIP and StyleGAN</a>

CLIP

实验

Zero-shot Transfer

zero-shot prediction 代码

Linear-probe evaluation

CLIP 局限

其余

应用

1. CLIP-Art

2. CLIP + StyleGAN

3. 语义搜索

4. 跨模态搜索

5. 图像分类

6. StyleGAN-NADA

7. StyleCLIP

材料

2 条评论

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

CLIP - 图像文本多模态模型

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款