Microsoft and Tsinghua Unveil Glyph-ByT5-v2: Stunning Posters in 10 Languages!
A Breakthrough Tool for High-Quality Multilingual Text-to-Image Generation, Supporting English, Chinese, Japanese, Korean, and More.
Tsinghua, Peking University, Microsoft, and the University of Liverpool have jointly introduced Glyph-ByT5-v2, a tool that supports multilingual text-to-image generation. It includes languages such as English, Chinese, Japanese, Korean, French, German, Spanish, Italian, Portuguese, and Russian.
Below, we showcase the visual text results in Chinese, English, Japanese, and Korean to give you an idea.
Recently, Glyph-ByT5 achieved high precision in visual text rendering for graphic design images, but it was limited to English and lacked visual appeal.
In this work, we address these limitations with Glyph-ByT5-v2. It supports accurate visual text rendering in 10 different languages and achieves better aesthetic quality.
To achieve this, we made the following contributions:
1. Created a high-quality multilingual glyph text and graphic design dataset, with over 1 million glyph text pairs and 10 million graphic design image-text pairs, covering 9 additional languages.
2. Built a multilingual visual paragraph benchmark with 1,000 prompts (100 per language) to evaluate multilingual visual spelling accuracy.
3. Applied the latest step-aware preference learning method to improve visual aesthetics.
Combining these technologies, we present the powerful custom multilingual text encoder Glyph-ByT5-v2 and the visually appealing graphic generation model Glyph-SDXL-v2, both supporting accurate spelling in 10 different languages.
Considering that the latest DALLE-3 and Ideogram still struggle with multilingual visual text rendering tasks, we believe our work marks a significant advancement.
Improving Multilingual Visual Text Rendering Accuracy
The table above shows results for multilingual visual text rendering.
Our method achieved various character counts.
We tested word-level accuracy for seven languages and character-level accuracy for three languages. All results come from one model, not separate models for each language.
Enhancing Aesthetic Quality
User studies show that humans prefer graphic design images for spelling accuracy, layout quality, and visual appeal in multiple languages.
Results Showcase
The images below show the effects of Stepped Perceptual Optimization (SPO) after training.
The rows show images from:
1. Glyph-SDXL
2. Glyph-SDXL Albedo
3. Glyph-SDXL Albedo + SPO
The next image compares multilingual generation results from DALL·E 3 and Ideogram 1.0.
More effect demonstration.
Conclusion
We created an improved custom multilingual text encoder for precise text rendering.
We built large, high-quality datasets of multilingual glyph text and graphic designs. These helped us train Glyph-ByT5-v2 and Glyph-SDXL-v2.
Replacing the original SDXL with a version optimized for human preferences greatly improved visual appeal. Detailed comparisons and user studies prove our method works well.
Paper: https://arxiv.org/abs/2406.10208
Project: https://github.com/AIGText/Glyph-ByT5
Model: https://huggingface.co/GlyphByT5/Glyph-SDXL-v2
Try it: https://huggingface.co/spaces/GlyphByT5/Glyph-SDXL-v2