TextCrafter:

TextCrafter logo
TextCrafter CVTG benchmark hero poster

Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation

A project page for complex visual text generation with multi-text insulation, Text-oriented Attention, and the CVTG-2K benchmark.

Ying Tai1, Nikai Du1, Rui Xie1*, Zhennan Chen1*, Qian Wang2, Zhengkai Jiang3, Kai Zhang1, Jian Yang1

School of Intelligence Science and Technology, Nanjing University · Jiutian Research · The Hong Kong University of Science and Technology

*Corresponding author(s): Rui Xie, Zhennan Chen.

These authors contributed equally to this work.

TextCrafter hero card 1
TextCrafter hero card 2
TextCrafter hero card 3
TextCrafter hero card 4

Demo Video

What is TextCrafter?

TextCrafter is a Complex Visual Text Generation (CVTG) framework inspired by selective visual attention in cognitive science, introducing multi-text insulation and text-oriented attention to improve generation quality in complex multi-text scenes.

Complex Visual Text Generation

"A contemporary and playful slide designed like a vibrant bulletin board, displaying textual content pinned on colorful sticky notes arranged casually yet clearly across a textured corkboard background. At the top-left corner, the bold, handwritten-style heading "Healthy Eating Made Simple" stands out clearly, accompanied by a short note: "Small habits make big differencespractice daily for lasting wellbeing". Nearby toward the upper center area, a bright yellow note reads "Mindful Portions", with smaller explanatory text: "Listen to your body's signals, eat slowly and stop before overly full". Just below and to the right, a pastel-green note labeled "Include Vegetables" explains succinctly: "Aim to fill half your plate with colorful veggies each day". Moving leftward toward the lower section, a soft-blue sticky note titled "Stay Hydrated" contains the brief sentence: "Drink sufficient water regularly to maintain energy and focus". Decorative elements on the corkboard, such as small doodled fruits, vegetables, and paper clips scattered around the borders, enhance visual warmth and highlight the approachable nature of the presented messages."

TextCrafter
TextCrafter generated bulletin-board style health slide

Datasets for TextCrafter

Explore the dedicated datasets introduced for TextCrafter to evaluate complex visual text generation under more realistic, multi-text, and attribute-aware settings.

CVTG-2K

CVTG-2K is a dedicated benchmark tailored to the CVTG task. It comprises 2,000 high-quality prompts with diverse region quantities ranging from 2 to 5 and rich visual attributes such as color, font, and size. Unlike previous datasets that predominantly focus on single-region or fixed-template scenarios, CVTG-2K combines global context, multiple text contents, explicit position requirements, and fine-grained visual attributes into a more challenging and realistic evaluation setting.

Combined benchmark complexity and scene distribution figures for CVTG-2K and CVTG-Hard

Technical Framework

TextCrafter combines multi-text insulation with quotation-guided text-oriented attention to suppress cross-text interference and keep visual text tokens concentrated in the correct regions.

Technical framework overview for TextCrafter derived from method.pdf

Experimental Results

Across CVTG-2K, CVTG-Hard, LongText-Bench, and Geneval, TextCrafter consistently improves text fidelity in complex multi-text scenes while preserving strong general text-to-image performance.

Model Performance Comparison

Quantitative comparisons in the paper show stronger word and span accuracy, higher normalized edit distance, and more robust long-text rendering on difficult benchmarks, while maintaining competitive quality on a general-purpose text-to-image benchmark.

`*` indicates results cited from previous papers.

Model
Word Accuracy
NED
CLIPScore
VQAScore
Aesthetics
FLUX.1 dev[Black Forest Labs 2024]
0.4965
0.6879
0.7401
0.8886
5.91
GPT Image 1 [High]*[OpenAI 2025]
0.8569
0.9478
0.7982
-
-
Gemini 2.5 Flash Image*[Google 2025]
0.7364
0.8516
-
-
-
Seedream 4.5*[ByteDance 2025]
0.8990
0.9483
0.8069
-
-
Qwen-Image*[Alibaba 2025]
0.8288
0.9116
0.8017
-
-
Z-Image*[Alibaba 2025]
0.8671
0.9367
0.7969
-
-
HunyuanImage-3.0*[Tencent 2025]
0.7650
0.8765
0.8121
-
-
Longcat-Image*[Meituan 2025]
0.8658
0.9361
0.7859
-
-
Emu3.5*[BAAI 2025]
0.9123
0.9656
-
-
-
GLM-Image*[Z.ai 2026]
0.9116
0.9557
0.7877
-
-
SD3.5 Large[ICML 2024]
0.6548
0.8470
0.7797
0.9297
5.56
AnyText[ICLR 2024]
0.1804
0.4675
0.7432
0.6935
4.53
TextDiffuser-2[ECCV 2024]
0.2326
0.4353
0.6765
0.5627
4.51
RAG-Diffusion[ICCV 2025]
0.2648
0.4498
0.6688
0.6397
5.58
3DIS[ICLR 2025]
0.3813
0.6505
0.7767
0.8684
4.86
TextCrafter (Qwen-Image)
0.9400(+13.4%)
0.9757(+7.0%)
0.8305(+3.6%)
0.9570
5.90

BibTeX

@misc{tai2026investigatingtextinsulationattention,
  title={Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation},
  author={Ying Tai and Nikai Du and Rui Xie and Zhennan Chen and Qian Wang and Zhengkai Jiang and Kai Zhang and Jian Yang},
  year={2026},
  eprint={2503.23461},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.23461},
}