FG-CLIP 2: 中英双语视觉语言对齐模型

本仓库是FG-CLIP及FG-CLIP 2的官方实现版本，作为新一代文本-图像跨模态模型，在细粒度理解方面表现卓越。FG-CLIP 2 支持中英双语，在 29 个数据集和 8 类多样化任务中，该模型超越包括SigLIP 2 和 MetaCLIP 2在内的强力基线模型，在两种语言任务中均取得目前的最佳性能。

FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Ji Ao, Dawei Leng†, Yuhui Yin (*Equal Contribution, ✝Corresponding Author)

FG-CLIP: Fine-Grained Visual and Textual Alignment (code branch: v1.0)
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin (*Equal Contribution, ✝Corresponding Author)

🔥 新闻

🚀 [2025/10/14] 我们已上传FG-CLIP 2代码和模型权重
🚀 [2025/10/14] 我们发布了论文 FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model
🚀 [2025/09/29] 我们刚刚开源了FG-CLIP的MCP服务器实现, 更多细节请点击 FGCLIP-MCP
🚀 [2025/07/29] 我们提供FG-CLIP 2 base模型的API访问，该模型在性能上显著优于FG-CLIP, 更多细节请查看 research.360.cn
🚀 [2025/07/09] 我们创建了两个演示demo，分别针对 fine-grained retrieval 和 dense feature display
🚀 [2025/05/09] 我们已将模型上传到 🤗(https://huggingface.co/qihoo360/fg-clip-large)，可以支持快捷使用！
🚀 [2025/05/09] 我们已更新FG-CLIP github仓库，现在您可以测试我们的模型了！
🚀 [2025/05/09] 我们发布了论文 FG-CLIP: Fine-Grained Visual and Textual Alignment.
🚀 [2025/05/02] FG-CLIP被ICML'25会议接收。

Fine-grained vision-language understanding requires precise alignment between visual content and linguistic descriptions, a capability that remains limited in current models, particularly in non-English settings. While models like CLIP perform well on global alignment, they often struggle to capture fine-grained details in object attributes, spatial relations, and linguistic expressions, with limited support for bilingual comprehension. To address these challenges, we introduce FG-CLIP 2, a bilingual vision-language model designed to advance fine-grained alignment for both English and Chinese. The key ingredients of FG-CLIP 2 are summarized below.

Rich Fine-Grained Supervision. Including region-text matching and long-caption modeling, alongside multiple discriminative objectives. We further introduce the Textual Intra-modal Contrastive (TIC) loss to better distinguish semantically similar captions.
Bilingual Multimodal Data. Trained on a carefully curated mixture of large-scale English and Chinese data, FG-CLIP 2 achieves powerful bilingual performance.
Performance. Extensive experiments on 29 datasets across 8 tasks show that FG-CLIP 2 outperforms existing methods, achieving state-of-the-art results in both languages.
Chinese Multimodal Benchmark. To enable rigorous evaluation, we present a new benchmark for Chinese multimodal understanding, featuring long-caption retrieval and bounding box classification. -->

模型架构
安装
模型仓库
快速开始
训练
评测

模型架构

我们的方法采用一个两阶段分层学习框架，从全局语义到细粒度细节，逐步增强视觉-语言对齐能力。

第一阶段：全局语义对齐
我们从大规模图像-文本对开始，每对数据均包含一个短文本描述（用于简洁的场景级描述）和一个长文本描述（用于丰富的上下文细节）。在此双语语料库上进行训练，可实现强大的全局对齐，为英文和中文的跨模态理解奠定坚实基础。

第二阶段：细粒度视觉-语言学习
在全局对齐表示的基础上，我们引入区域级监督信号和多种细粒度目标，以强化局部对应关系。具体包括：

细粒度视觉学习：通过 RoIAlign 提取的区域特征与短语级描述进行区域-文本对齐。
细粒度文本学习：利用属性扰动生成的 hard negative 样本，区分细微的文本差异。
带全局阈值同步的跨模态排序损失：采用动态边距的排序损失，并通过全局同步的阈值实现稳定的 hard negative 挖掘。
文本模态内对比损失：在单一语言内部进行对比学习，以区分语义相近但不同的区域描述。

安装

conda create -n FGCLIP2 python=3.10 -y
conda activate FGCLIP2
cd FG-CLIP && pip install -e .

模型仓库

模型	视觉编码器	模型权重	演示界面
FG-CLIP-Base	vit-base-patch16-224	🤗Huggingface	Retrieval & Dense Feature
FG-CLIP-Large	vit-large-patch14-336	🤗Huggingface
FG-CLIP2-Base	vit-base-patch16	🤗Huggingface	Retrieval & Dense Feature
FG-CLIP2-Large	vit-large-patch16	🤗Huggingface
FG-CLIP2-So400m	vit-so400m-patch16	🤗Huggingface

评测基准

数据集	链接
LIT-CN	🤗https://huggingface.co/datasets/qihoo360/LIT-CN
DCI-CN	🤗https://huggingface.co/datasets/qihoo360/DCI-CN
DOCCI-CN	🤗https://huggingface.co/datasets/qihoo360/DOCCI-CN
BoxClass-CN	🤗https://huggingface.co/datasets/qihoo360/BoxClass-CN

快速开始 🤗

加载模型

import torch
from PIL import Image
from transformers import (
    AutoImageProcessor,
    AutoTokenizer,
    AutoModelForCausalLM,
)
model_root = "fgclip2-base-patch16"
model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda()
device = model.device
tokenizer = AutoTokenizer.from_pretrained(model_root)
image_processor = AutoImageProcessor.from_pretrained(model_root)

检索

def determine_max_value(image):
    w,h = image.size
    max_val = (w//16)*(h//16)
    if max_val > 784:
        return 1024
    elif max_val > 576:
        return 784
    elif max_val > 256:
        return 576
    elif max_val > 128:
        return 256
    else:
        return 128
img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image_input = image_processor(images=image, max_num_patches=determine_max_value(image), return_tensors="pt").to(device)
NOTE Short captions: max_length=64 walk_type="short"(default)
NOTE Long captions: max_length=196 walk_type="long"
captions = [
"一个简约风格的卧室角落，黑色金属衣架上挂着多件米色和白色的衣物，下方架子放着两双浅色鞋子，旁边是一盆绿植，左侧可见一张铺有白色床单和灰色枕头的床。",
"一个简约风格的卧室角落，黑色金属衣架上挂着多件红色和蓝色的衣物，下方架子放着两双黑色高跟鞋，旁边是一盆绿植，左侧可见一张铺有白色床单和灰色枕头的床。",
"一个简约风格的卧室角落，黑色金属衣架上挂着多件米色和白色的衣物，下方架子放着两双运动鞋，旁边是一盆仙人掌，左侧可见一张铺有白色床单和灰色枕头的床。",
"一个繁忙的街头市场，摊位上摆满水果，背景是高楼大厦，人们在喧闹中购物。"
]
captions = [caption.lower() for caption in captions]
caption_input = tokenizer(captions, padding="max_length", max_length=196, truncation=True, return_tensors="pt").to(device)
with torch.no_grad():
image_feature = model.get_image_features(**image_input)
text_feature = model.get_text_features(**caption_input,walk_type="long")
image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
logits_per_image = image_feature @ text_feature.T
logit_scale, logit_bias = model.logit_scale.to(text_feature.device), model.logit_bias.to(text_feature.device)
logits_per_image = logits_per_image * logit_scale.exp() + logit_bias

密集特征效果展示


import math
import matplotlib
matplotlib.use('Agg') 
import matplotlib.pyplot as plt
def resize_short_edge(image, target_size=2048):
if isinstance(image, str):
image = Image.open(image)
width, height = image.size
short_edge = min(width, height)
if short_edge &gt;= target_size:
    return image
scale = target_size / short_edge
new_width = int(width * scale)
new_height = int(height * scale)
resized_image = image.resize((new_width, new_height))
return resized_image

img_root = "cat_dfclor.jpg"
image = Image.open(img_root).convert("RGB")
image = resize_short_edge(image,target_size=2048)
image_input = image_processor(images=image, max_num_patches=16384, return_tensors="pt").to(device)
captions = ["电脑","黑猫","窗户","window","white cat","book"]
with torch.no_grad():
dense_image_feature = model.get_image_dense_feature(**image_input)
spatial_values = image_input[&quot;spatial_shapes&quot;][0]
real_h = spatial_values[0].item()
real_w = spatial_values[1].item()
real_pixel_tokens_num = real_w*real_h
dense_image_feature = dense_image_feature[0][:real_pixel_tokens_num]
captions = [caption.lower() for caption in captions]
caption_input = tokenizer(captions, padding=&quot;max_length&quot;, max_length=64, truncation=True, return_tensors=&quot;pt&quot;).to(device)

text_feature = model.get_text_features(**caption_input, walk_type=&quot;box&quot;)
text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True)
dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True)

similarity = dense_image_feature @ text_feature.T
similarity = similarity.cpu()
num_classes = len(captions)
cols = 3
rows = (num_classes + cols - 1) // cols
aspect_ratio = real_w / real_h
fig_width_inch = 3 * cols

fig_height_inch = fig_width_inch / aspect_ratio * rows / cols
fig, axes = plt.subplots(rows, cols, figsize=(fig_width_inch, fig_height_inch))
fig.subplots_adjust(wspace=0.01, hspace=0.01)
if num_classes == 1:
axes = [axes]
else:
axes = axes.flatten()
for cls_index in range(num_classes):
similarity_map = similarity[:, cls_index].cpu().numpy()
show_image = similarity_map.reshape((real_h, real_w))
ax = axes[cls_index]
ax.imshow(show_image, cmap=&#39;viridis&#39;, aspect=&#39;equal&#39;)  
ax.set_xticks([])
ax.set_yticks([])
ax.axis(&#39;off&#39;)

for idx in range(num_classes, len(axes)):
axes[idx].axis('off')
savename = "FGCLIP2_dfcolor_cat_all_2K.png"
plt.savefig(savename, dpi=150, bbox_inches='tight', pad_inches=0.05)
plt.close()

训练

数据准备

我们提供使用 🤗FineHARD dataset 进行第二阶段训练的代码。FineHARD 数据集包含1200万张图像、4000万个带有细粒度区域描述的边界框，以及1000万个hard negative样本。

关于数据准备，请参考 Data: FineHARD

准备训练

我们的训练和推理代码完全基于 Hugging Face 提供的 transformers 仓库，非常易于使用和复现。我们在 scripts 目录中提供了训练脚本。
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
我们的训练脚本支持 zero2、tf32 加速和 bf16 精度（注意 fp16 精度可能导致梯度 NAN）。如果您不满足上述条件，请关闭 tf32 并使用 torchrun 替代 deepspeed 启动。

bash scripts/train/stage2_fgclip2.sh

评测

数据准备

从以下链接下载 share-captioner_coco_lcs_sam_1246k_1107.json https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json

从以下链接下载 CocoCaptions 并放入 data/coco/annotations/

https://github.com/tylin/coco-caption

从以下链接下载 COCO 并放入 data/coco

https://cocodataset.org/dataset

DCI 的描述来自以下链接并放入 data/densely_captioned_images

https://github.com/facebookresearch/DCI

ImageNet-1K 来自以下链接并放入 data/IN1K_val

https://image-net.org/

ImageNet-v2 来自以下链接并放入 data/imagenetv2-matched-frequency-format-val

https://opendatalab.com/OpenDataLab/ImageNetV2/tree/main

bash scripts/eval/eval.sh

招聘中

我们正在招募多模态方向的学术实习生。如有兴趣，请将简历发送至 xiechunyu@360.cn.

引用

如果您在研究或应用中发现 FG-CLIP 2 对您有所帮助，请使用以下 BibTeX 引用：

@article{xie2025fg2,
  title={FG-CLIP 2: A Bilingual Fine-grained Vision-language Alignment Model},
  author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Ao, Ji and Leng, Dawei and Yin, Yuhui},
  journal={arXiv preprint arXiv:2510.10921},
  year={2025}
}

@article{xie2025fg,
  title={FG-CLIP: Fine-Grained Visual and Textual Alignment},
  author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui},
  journal={arXiv preprint arXiv:2505.05071},
  year={2025}
}

NAME

SYNOPSIS

INFO

DESCRIPTION

README

FG-CLIP 2: 中英双语视觉语言对齐模型

🔥 新闻

Contents

模型架构

安装

模型仓库

评测基准

快速开始 🤗

加载模型

检索

NOTE Short captions: max_length=64 walk_type="short"(default)

NOTE Long captions: max_length=196 walk_type="long"

密集特征效果展示

训练

数据准备

准备训练

评测

数据准备

招聘中

引用

SEE ALSO