No Code, No Cloud: On-Device Mockup-to-Code with Lightweight Vision-Language AI
Bridging the gap between visual design and functional code remains a persistent challenge in modern UI workflows, especially for small teams and non-programmers. Existing solutions, such as Figma-to-code tools and recent vision-language models (VLMs), often depend on proprietary cloud APIs or large-scale architectures, limiting offline operation, privacy, and control. We present LiteViT5, a lightweight, on-device vision-language model that directly generates HTML from images of design mockups, enabling private, no-code prototyping without cloud infrastructure. Built on a compact ViT–T5 encoder–decoder framework with 235M parameters, LiteViT5 achieves competitive results on both in-distribution (WebSight) and out-of-distribution (Design2Code) benchmarks. We evaluate its performance across structure, position, color, and CLIP-based similarity metrics and report its comparable performance to models 10–30× larger such as PaliGemma-3B, LLaVA-7B, and DeepSeek-VL-7B. We further assess LiteViT5 in a user study with 24 participants assessing perceived accuracy, code quality, and editability. Our findings show that LiteViT5 supports rapid design iteration, reduces reliance on developer handoff, making it a practical, assistive tool for democratizing web interface creation. This work highlights the potential of efficient, human-centered generative AI to empower interface design beyond expert-only workflows. To support transparency and reproducibility, we release LiteViT5 as an open-source model on Hugging Face: https://huggingface.co/LiteVit5/model.
Model Architecture
- Vision Encoder: SigLIP2 (frozen)
- Vision Processing: Multi-view fusion
- Seq2Seq Decoder: CodeT5-based decoder with language modeling head
- Input: Images (5 views per sample - 4 quarter views + 1 full view)
- Output: Generated HTML
Installation
uv add transformers torch accelerate
Usage
Loading the Model
from transformers import AutoModel, AutoTokenizer
from transformers import SiglipProcessor
# Load the model
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True)
# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")
Inference Example
from PIL import Image
import torch
from transformers import AutoModel, AutoTokenizer
from transformers import SiglipProcessor
# Load the model
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True, device_map="auto")
# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")
# Preprocess image (split into 4 parts + full image = 5 views)
def prepare_image(image_path: str, processor):
"""
Prepare image with 5 views (4 quarters + full).
Args:
image_path: Path to the image file
processor: SigLIP processor
Returns:
Tensor of shape [5, 3, 512, 512]
"""
image = Image.open(image_path).convert("RGB")
# Split into 4 quarters
width, height = image.size
quarters = [
image.crop((0, 0, width//2, height//2)), # top-left
image.crop((width//2, 0, width, height//2)), # top-right
image.crop((0, height//2, width//2, height)), # bottom-left
image.crop((width//2, height//2, width, height)), # bottom-right
]
# Process all views
processed = [
processor(images=q, return_tensors="pt")["pixel_values"]
for q in quarters
]
# Add full image
processed.append(
processor(images=image, return_tensors="pt")["pixel_values"]
)
pixel_values = torch.cat(processed, dim=0)
return pixel_values
def generate_text(model, pixel_values, tokenizer, max_length=512):
"""
Generate text from image.
Args:
model: LiteVit5 model
pixel_values: Preprocessed image tensor
tokenizer: Tokenizer for decoding
max_length: Maximum generation length
Returns:
Generated text string
"""
with torch.no_grad():
output_ids = model.generate(pixel_values, max_length=max_length)
text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return text
device = next(model.parameters()).device
# Process images
pixel_values = prepare_image("./image_13.png", processor)
pixel_values = pixel_values.to(device)
print("\nGenerating HTML from image_13.png...")
output_ids = model.generate(pixel_values, max_length=2024)
text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Generated: {text}")
- Downloads last month
- 8
Model tree for LiteVit5/model
Base model
Salesforce/codet5-base