NeoBabel: A Multilingual Open Tower for Visual Generation
Tokenizer
Text Tokenization:
NeoBabel adopts the tokenizer of the Gemma-2 model without any modifications. This approach maintains compatibility with multilingual inputs while utilizing proven tokenization methods from language modeling.
Image Tokenization:
NeoBabel leverages the MAGVIT-v2 quantizer retrained by Show-o on 25 million images. This lookup-free quantizer learns a discrete codebook of size K=8,192 and encodes 256×256 resolution images into 16×16 grids of discrete tokens.
Note: While the authors have only mentioned using Gemma-2, we can try other tokenizers, e.g. SUTRA tokenizer, Xmodel-1.5, T-Free , or even the universal tokenizer.
Architecture:
We can make a mixture-of-experts (MoE) architecture in the text processing and cross-modal alignment parts, as follows:
1. Language-Specific Text Experts
- Different experts could handle different language families (Romance, Germanic, Sino-Tibetan, etc.)
- Each expert specializes in the linguistic patterns, morphology, and syntax of specific language groups
- This would be particularly valuable given NeoBabel's support for typologically diverse languages (English, Chinese, Hindi, Persian, Dutch, French)
2. Cross-Modal Alignment Experts
- Separate experts for mapping different languages to the shared visual space
- Some languages might need different attention patterns when grounding to visual concepts
- For example, languages with rich morphology (Hindi) vs. analytic languages (Chinese) might benefit from different alignment strategies
3. Cultural Context Experts
- Experts that activate based on cultural context indicators in the text
- Could help with culture-specific visual generation (e.g., "wedding" generating different visual elements based on cultural context)
- These would operate on the text→visual mapping, not the visual tokens themselves
NeoBabel Multilingual Datasets
NeoBabel Pretraining Data:
- m-ImageNet-1K (not released yet): generated from ImageNet-1K
- m-SA-1B and m-CC12M (not released yet): generated from SA-1B and CC12M
- m-LAION-Aesthetic (not released yet): generated from LAION-Aesthetics
- m-JourneyDB (not released yet): generated from JourneyDB
Reference
NeoBabel: A Multilingual Open Tower for Visual Generation