Mobile-O is designed specifically for mobile and edge deployment. It combines a vision-language model with a diffusion-based image generator in a single unified architecture, enabling real-time multimodal understanding (VQA, OCR, reasoning) and high-quality image generation at 512×512 resolution — all with a memory footprint under 2GB.
Key result: Mobile-O generates 512×512 images in ~3 seconds and performs visual understanding in ~0.4 seconds on iPhone, with only 1.6B total parameters.
High-quality text-to-image synthesis at 512×512 resolution using a lightweight DiT decoder
Visual question answering, OCR, and multimodal reasoning powered by FastVLM
Instruction-based image editing combining understanding and generation pipelines
Overall architecture of Mobile-O: a unified vision–language–diffusion model for on-device multimodal understanding and generation.
FastVLM-0.5B combining FastViT as vision encoder and Qwen2-0.5B as the autoregressive language backbone for multimodal understanding.
FastViT + Qwen2-0.5BSANA-600M-512, a lightweight linear DiT-style diffusion transformer paired with a VAE encoder-decoder for text-to-image generation at 512×512.
Linear DiT + VAEA novel lightweight connector (~2.4M params) bridging VLM and diffusion decoder using layerwise feature fusion with temperature-scaled learnable weights.
~2.4M ParametersPretrain DiT and MCP on 4M text-image pairs. Visual encoders, LLM, and VAE are frozen.
📦 4M pairs →Finetune DiT and MCP on ~105K curated prompt-image pairs from BLIP3o + ShareGPT-4o.
📦 ~105K pairs →Post-train DiT, MCP, LLM (LoRA), and visual encoder on ~105K quadruplet samples.
📦 ~105K quadruplets
Unified multimodal post-training: jointly optimizing image generation and visual understanding via a multi-task objective.
Mobile-O runs entirely on-device with no cloud dependency. We release the full source code of the iOS app along with optimized MLX and CoreML model components.
| Model | Total Params | Download |
|---|---|---|
| Mobile-O-0.5B | 1.6B | 🤗 HuggingFace |
| Mobile-O-1.5B | 3.5B | 🤗 HuggingFace |
| Mobile-O-0.5B-iOS | iOS Components | 🤗 HuggingFace |
| Stage | Description | Download |
|---|---|---|
| Pre-training | 4M text-image pairs (JourneyDB) | 🤗 HuggingFace |
| SFT | ~105K curated prompt-image pairs | 🤗 HuggingFace |
| Post-training | ~105K unified quadruplet samples | 🤗 HuggingFace |
Qualitative examples of understanding, generation, and editing from Mobile-O
Extended qualitative generation samples from Mobile-O
Side-by-side generation comparison with other models
Side-by-side understanding comparison with other models
Install
Download Checkpoint
Image Understanding
Image Generation
Image Editing
@article{shaker2026mobileo,
title={Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device},
author={Shaker, Abdelrahman and Heakl, Ahmed and Muhammad, Jaseel and Thawkar, Ritesh and Thawakar, Omkar and Li, Senmao and Cholakkal, Hisham and Reid, Ian and Xing, Eric P. and Khan, Salman and Khan, Fahad Shahbaz},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2026}
}
This repo is partially built upon BLIP3o. Thanks to all the contributors for their great efforts.
This project is released under a non-commercial research license. The code, models, datasets, and app are intended solely for academic and research purposes. Please refer to the LICENSE file for full details.