TextCrafter:

Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation

A project page for complex visual text generation with multi-text insulation, Text-oriented Attention, and the CVTG-2K benchmark.

Research Paper Technical Framework

Ying Tai¹^†, Nikai Du¹^†, Rui Xie¹^*, Zhennan Chen¹^*, Qian Wang², Zhengkai Jiang³, Kai Zhang¹, Jian Yang¹

School of Intelligence Science and Technology, Nanjing University · Jiutian Research · The Hong Kong University of Science and Technology

*Corresponding author(s): Rui Xie, Zhennan Chen.

†These authors contributed equally to this work.

Research Paper GitHub 🤗HuggingFace HF Demo Datasets Demo Video

TextCrafter:

Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation

A project page for complex visual text generation with multi-text insulation, Text-oriented Attention, and the CVTG-2K benchmark.

Research Paper Technical Framework

Ying Tai¹^†, Nikai Du¹^†, Rui Xie¹^*, Zhennan Chen¹^*, Qian Wang², Zhengkai Jiang³, Kai Zhang¹, Jian Yang¹

School of Intelligence Science and Technology, Nanjing University · Jiutian Research · The Hong Kong University of Science and Technology

*Corresponding author(s): Rui Xie, Zhennan Chen.

†These authors contributed equally to this work.

Research Paper GitHub 🤗HuggingFace HF Demo Datasets Demo Video

Demo Video

What is TextCrafter?

TextCrafter is a Complex Visual Text Generation (CVTG) framework inspired by selective visual attention in cognitive science, introducing multi-text insulation and text-oriented attention to improve generation quality in complex multi-text scenes.

Complex Visual Text Generation

"A contemporary and playful slide designed like a vibrant bulletin board, displaying textual content pinned on colorful sticky notes arranged casually yet clearly across a textured corkboard background. At the top-left corner, the bold, handwritten-style heading "Healthy Eating Made Simple" stands out clearly, accompanied by a short note: "Small habits make big differencespractice daily for lasting wellbeing". Nearby toward the upper center area, a bright yellow note reads "Mindful Portions", with smaller explanatory text: "Listen to your body's signals, eat slowly and stop before overly full". Just below and to the right, a pastel-green note labeled "Include Vegetables" explains succinctly: "Aim to fill half your plate with colorful veggies each day". Moving leftward toward the lower section, a soft-blue sticky note titled "Stay Hydrated" contains the brief sentence: "Drink sufficient water regularly to maintain energy and focus". Decorative elements on the corkboard, such as small doodled fruits, vegetables, and paper clips scattered around the borders, enhance visual warmth and highlight the approachable nature of the presented messages."

TextCrafter

Datasets for TextCrafter

Explore the dedicated datasets introduced for TextCrafter to evaluate complex visual text generation under more realistic, multi-text, and attribute-aware settings.

CVTG-2K

CVTG-2K is a dedicated benchmark tailored to the CVTG task. It comprises 2,000 high-quality prompts with diverse region quantities ranging from 2 to 5 and rich visual attributes such as color, font, and size. Unlike previous datasets that predominantly focus on single-region or fixed-template scenarios, CVTG-2K combines global context, multiple text contents, explicit position requirements, and fine-grained visual attributes into a more challenging and realistic evaluation setting.

Technical Framework

TextCrafter combines multi-text insulation with quotation-guided text-oriented attention to suppress cross-text interference and keep visual text tokens concentrated in the correct regions.

Experimental Results

Across CVTG-2K, CVTG-Hard, LongText-Bench, and Geneval, TextCrafter consistently improves text fidelity in complex multi-text scenes while preserving strong general text-to-image performance.

Model Performance Comparison

Quantitative comparisons in the paper show stronger word and span accuracy, higher normalized edit distance, and more robust long-text rendering on difficult benchmarks, while maintaining competitive quality on a general-purpose text-to-image benchmark.

`*` indicates results cited from previous papers.

Model	Word Accuracy↑	NED↑	CLIPScore↑	VQAScore↑	Aesthetics↑
FLUX.1 dev[Black Forest Labs 2024]	0.4965	0.6879	0.7401	0.8886	5.91
GPT Image 1 [High]^*[OpenAI 2025]	0.8569	0.9478	0.7982	-	-
Gemini 2.5 Flash Image^*[Google 2025]	0.7364	0.8516	-	-	-
Seedream 4.5^*[ByteDance 2025]	0.8990	0.9483	0.8069	-	-
Qwen-Image^*[Alibaba 2025]	0.8288	0.9116	0.8017	-	-
Z-Image^*[Alibaba 2025]	0.8671	0.9367	0.7969	-	-
HunyuanImage-3.0^*[Tencent 2025]	0.7650	0.8765	0.8121	-	-
Longcat-Image^*[Meituan 2025]	0.8658	0.9361	0.7859	-	-
Emu3.5^*[BAAI 2025]	0.9123	0.9656	-	-	-
GLM-Image^*[Z.ai 2026]	0.9116	0.9557	0.7877	-	-
SD3.5 Large[ICML 2024]	0.6548	0.8470	0.7797	0.9297	5.56
AnyText[ICLR 2024]	0.1804	0.4675	0.7432	0.6935	4.53
TextDiffuser-2[ECCV 2024]	0.2326	0.4353	0.6765	0.5627	4.51
RAG-Diffusion[ICCV 2025]	0.2648	0.4498	0.6688	0.6397	5.58
3DIS[ICLR 2025]	0.3813	0.6505	0.7767	0.8684	4.86
TextCrafter (Qwen-Image)	0.9400(+13.4%)	0.9757(+7.0%)	0.8305(+3.6%)	0.9570	5.90

Platform Gallery

Representative gallery samples highlight TextCrafter's ability to render complex visual texts across diverse scenes while mitigating text misgeneration, omissions, and hallucinations.

BibTeX

@misc{tai2026investigatingtextinsulationattention,
  title={Investigating Text Insulation and Attention Mechanisms for Complex Visual Text Generation},
  author={Ying Tai and Nikai Du and Rui Xie and Zhennan Chen and Qian Wang and Zhengkai Jiang and Kai Zhang and Jian Yang},
  year={2026},
  eprint={2503.23461},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2503.23461},
}

TextCrafter:

TextCrafter:

Demo Video

What is TextCrafter?

Datasets for TextCrafter

CVTG-2K

Technical Framework

Multi-text Insulation

Text-oriented Attention

Experimental Results

Model Performance Comparison

Platform Gallery

BibTeX