NeoBabel: A Multilingual Open Tower for Visual Generation

Tokenizer

Text Tokenization:
NeoBabel adopts the tokenizer of the Gemma-2 model without any modifications. This approach maintains compatibility with multilingual inputs while utilizing proven tokenization methods from language modeling.

Image Tokenization:
NeoBabel leverages the MAGVIT-v2 quantizer retrained by Show-o on 25 million images. This lookup-free quantizer learns a discrete codebook of size K=8,192 and encodes 256×256 resolution images into 16×16 grids of discrete tokens.

Note: While the authors have only mentioned using Gemma-2, we can try other tokenizers, e.g. SUTRA tokenizer, Xmodel-1.5, T-Free , or even the universal tokenizer.

Architecture:

We can make a mixture-of-experts (MoE) architecture in the text processing and cross-modal alignment parts, as follows:

1. Language-Specific Text Experts
  - Different experts could handle different language families (Romance, Germanic, Sino-Tibetan, etc.)
  - Each expert specializes in the linguistic patterns, morphology, and syntax of specific language groups
  - This would be particularly valuable given NeoBabel's support for typologically diverse languages (English, Chinese, Hindi, Persian, Dutch, French)

2. Cross-Modal Alignment Experts
  - Separate experts for mapping different languages to the shared visual space
  - Some languages might need different attention patterns when grounding to visual concepts
  - For example, languages with rich morphology (Hindi) vs. analytic languages (Chinese) might benefit from different alignment strategies

3. Cultural Context Experts
  - Experts that activate based on cultural context indicators in the text
  - Could help with culture-specific visual generation (e.g., "wedding" generating different visual elements based on cultural context)
  - These would operate on the text→visual mapping, not the visual tokens themselves

NeoBabel Multilingual Datasets

NeoBabel Pretraining Data:

Reference
NeoBabel: A Multilingual Open Tower for Visual Generation