No Code, No Cloud: On-Device Mockup-to-Code with Lightweight Vision-Language AI

Bridging the gap between visual design and functional code remains a persistent challenge in modern UI workflows, especially for small teams and non-programmers. Existing solutions, such as Figma-to-code tools and recent vision-language models (VLMs), often depend on proprietary cloud APIs or large-scale architectures, limiting offline operation, privacy, and control. We present LiteViT5, a lightweight, on-device vision-language model that directly generates HTML from images of design mockups, enabling private, no-code prototyping without cloud infrastructure. Built on a compact ViT–T5 encoder–decoder framework with 235M parameters, LiteViT5 achieves competitive results on both in-distribution (WebSight) and out-of-distribution (Design2Code) benchmarks. We evaluate its performance across structure, position, color, and CLIP-based similarity metrics and report its comparable performance to models 10–30× larger such as PaliGemma-3B, LLaVA-7B, and DeepSeek-VL-7B. We further assess LiteViT5 in a user study with 24 participants assessing perceived accuracy, code quality, and editability. Our findings show that LiteViT5 supports rapid design iteration, reduces reliance on developer handoff, making it a practical, assistive tool for democratizing web interface creation. This work highlights the potential of efficient, human-centered generative AI to empower interface design beyond expert-only workflows. To support transparency and reproducibility, we release LiteViT5 as an open-source model on Hugging Face: https://huggingface.co/LiteVit5/model.

Model Architecture

  • Vision Encoder: SigLIP2 (frozen)
  • Vision Processing: Multi-view fusion
  • Seq2Seq Decoder: CodeT5-based decoder with language modeling head
  • Input: Images (5 views per sample - 4 quarter views + 1 full view)
  • Output: Generated HTML

Installation

uv add transformers torch accelerate

Usage

Loading the Model

from transformers import AutoModel, AutoTokenizer
from transformers import SiglipProcessor

# Load the model
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True)

# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")

Inference Example

from PIL import Image
import torch

from transformers import AutoModel, AutoTokenizer
from transformers import SiglipProcessor

# Load the model
model = AutoModel.from_pretrained("LiteVit5/model", trust_remote_code=True, device_map="auto")

# Load tokenizer and processor
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codet5-base")
processor = SiglipProcessor.from_pretrained("google/siglip2-base-patch16-512")

# Preprocess image (split into 4 parts + full image = 5 views)
def prepare_image(image_path: str, processor):
    """
    Prepare image with 5 views (4 quarters + full).

    Args:
        image_path: Path to the image file
        processor: SigLIP processor

    Returns:
        Tensor of shape [5, 3, 512, 512]
    """
    image = Image.open(image_path).convert("RGB")

    # Split into 4 quarters
    width, height = image.size
    quarters = [
        image.crop((0, 0, width//2, height//2)),           # top-left
        image.crop((width//2, 0, width, height//2)),       # top-right
        image.crop((0, height//2, width//2, height)),      # bottom-left
        image.crop((width//2, height//2, width, height)),  # bottom-right
    ]

    # Process all views
    processed = [
        processor(images=q, return_tensors="pt")["pixel_values"]
        for q in quarters
    ]
    # Add full image
    processed.append(
        processor(images=image, return_tensors="pt")["pixel_values"]
    )

    pixel_values = torch.cat(processed, dim=0)
    return pixel_values

def generate_text(model, pixel_values, tokenizer, max_length=512):
    """
    Generate text from image.

    Args:
        model: LiteVit5 model
        pixel_values: Preprocessed image tensor
        tokenizer: Tokenizer for decoding
        max_length: Maximum generation length

    Returns:
        Generated text string
    """
    with torch.no_grad():
        output_ids = model.generate(pixel_values, max_length=max_length)

    text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return text

device = next(model.parameters()).device

# Process images
pixel_values = prepare_image("./image_13.png", processor)
pixel_values = pixel_values.to(device)
print("\nGenerating HTML from image_13.png...")
output_ids = model.generate(pixel_values, max_length=2024)
text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(f"Generated: {text}")
Downloads last month
8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LiteVit5/model

Finetuned
(92)
this model

Dataset used to train LiteVit5/model