Mobile-O

Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker^1,∗,†, Ahmed Heakl^1,∗, Jaseel Muhammad¹, Ritesh Thawkar¹, Omkar Thawakar¹, Senmao Li¹,
Hisham Cholakkal¹, Ian Reid¹, Eric P. Xing^1,2, Salman Khan^1,†, Fahad Shahbaz Khan^1,3,†

¹Mohamed bin Zayed University of Artificial Intelligence ²Carnegie Mellon University ³Linköping University

*Equal Contributions †Project Leaders

Paper Code 🤗 Models 🤗 Datasets Live Demo App Store

1.6B

Total Parameters

~3s

Image Generation (iPhone)

~0.4s

Visual Understanding (iPhone)

<2GB

Memory Footprint

Overview

Mobile-O is designed specifically for mobile and edge deployment. It combines a vision-language model with a diffusion-based image generator in a single unified architecture, enabling real-time multimodal understanding (VQA, OCR, reasoning) and high-quality image generation at 512×512 resolution — all with a memory footprint under 2GB.

Key result: Mobile-O generates 512×512 images in ~3 seconds and performs visual understanding in ~0.4 seconds on iPhone, with only 1.6B total parameters.

Capabilities

🖼️ Image Generation

High-quality text-to-image synthesis at 512×512 resolution using a lightweight DiT decoder

👁️ Image Understanding

Visual question answering, OCR, and multimodal reasoning powered by FastVLM

✏️ Image Editing

Instruction-based image editing combining understanding and generation pipelines

💬 Text → Text

🖼️ Image → Text

✨ Text → Image

✏️ Text + Image → Image

🔄 Unified Training

Architecture

Overall architecture of Mobile-O: a unified vision–language–diffusion model for on-device multimodal understanding and generation.

Vision-Language Model

FastVLM-0.5B combining FastViT as vision encoder and Qwen2-0.5B as the autoregressive language backbone for multimodal understanding.

FastViT + Qwen2-0.5B

Diffusion Decoder

SANA-600M-512, a lightweight linear DiT-style diffusion transformer paired with a VAE encoder-decoder for text-to-image generation at 512×512.

Linear DiT + VAE

Mobile Conditioning Projector

A novel lightweight connector (~2.4M params) bridging VLM and diffusion decoder using layerwise feature fusion with temperature-scaled learnable weights.

~2.4M Parameters

Training Pipeline

Cross-Modal Alignment

Pretrain DiT and MCP on 4M text-image pairs. Visual encoders, LLM, and VAE are frozen.

📦 4M pairs →

Supervised Fine-tuning

Finetune DiT and MCP on ~105K curated prompt-image pairs from BLIP3o + ShareGPT-4o.

📦 ~105K pairs →

Unified Post-Training

Post-train DiT, MCP, LLM (LoRA), and visual encoder on ~105K quadruplet samples.

📦 ~105K quadruplets

Unified multimodal post-training: jointly optimizing image generation and visual understanding via a multi-task objective.

Mobile App

Mobile-O runs entirely on-device with no cloud dependency. We release the full source code of the iOS app along with optimized MLX and CoreML model components.

📱

🧩

⚡ ~3s

Image Generation

👁️ ~0.4s

Visual Understanding

💾 <2GB

Memory Footprint

Model Checkpoints

Model	Total Params	Download
Mobile-O-0.5B	1.6B	🤗 HuggingFace
Mobile-O-1.5B	3.5B	🤗 HuggingFace
Mobile-O-0.5B-iOS	iOS Components	🤗 HuggingFace

📦 Training Datasets

Stage	Description	Download
Pre-training	4M text-image pairs (JourneyDB)	🤗 HuggingFace
SFT	~105K curated prompt-image pairs	🤗 HuggingFace
Post-training	~105K unified quadruplet samples	🤗 HuggingFace

Qualitative Results

✨

Mobile-O Samples

Qualitative examples of understanding, generation, and editing from Mobile-O

🖼️

Generation Results

Extended qualitative generation samples from Mobile-O

⚔️

Generation Comparison

Side-by-side generation comparison with other models

🔍

Understanding Comparison

Side-by-side understanding comparison with other models

Quick Start

Install

conda create -n mobileo python=3.12 -y
conda activate mobileo
pip install -r requirements.txt
            

Download Checkpoint

python -c "from huggingface_hub import snapshot_download; print(snapshot_download(repo_id='Amshaker/Mobile-O-0.5B', repo_type='model', local_dir='checkpoints', allow_patterns=['final_merged_model_23620/*']))"
            

Image Understanding

python infer_und.py --model_path /path/to/checkpoint/ --image_path assets/cute_cat.png --prompt "What is in the image?"