Show-o, don’t tell: new unified AI model does it all

A compact AI model understands images, generates visuals and weaves language all within a single framework. Electrical and
In early 2024, Assistant Professor Mike Shou was repeatedly met with the same question from peers in the generative AI community: Should a model use step-by-step prediction or a gradual refinement process to generate content?
Issue 06 | Aug 2025
Both approaches had been effective. One underpinned most large language models, predicting each word based on the last, namely autoregressive modelling; the other had become popular in image generation, transforming visual noise into clarity over several stages, namely diffusion modelling. But to Asst Prof Shou, the question wasn’t which method to choose. It was why they were treated as mutually exclusive in the first place.
“There’s no fundamental reason they should conflict. So we started thinking — what if we could combine them?”Asst Prof Shou ponders.

That idea became the impetus for Show-o, a unified AI model designed by Asst Prof Shou and his team at the Department of Electrical and Computer Engineering, College of Design and Engineering, National University of Singapore. Rather than specialising in a single task, Show-o moves fluidly between analysing content and producing it. From understanding images and answering questions, to generating pictures from text and editing visuals with prompts, and even composing video-like sequences with matching captions, Show-o does it all using just one compact model.
“The name of the model is a nod to our lab, Show Lab, but the ‘o’ adds a layer. It stands for ‘omni’, which represents the idea of an all-in-one model that can handle any combination of tasks,” says Asst Prof Shou.
The team’s work, published as a conference paper at the International Conference on Learning Representations (ICLR) 2025, introduces a practical architecture for bridging two widely used generative strategies and, in doing so, points towards a more versatile kind of AI.
The design is based on an elegantly simple idea: use the best method for each kind of task. For multimodal understanding, Show-o builds sequences one token at a time, following today’s autoregressive modelling style in large language models. For visual generation, it takes a different route. It begins with a partially “masked” version of the image and fills in the missing parts gradually, guided by context. This image generation process is based on a stripped-down version of what’s known as discrete diffusion modelling, which is faster than autoregressive models and easier to integrate into large language models than continuous diffusion modelling.
Issue 06 | Aug 2025
The novelty of the team’s approach is that these two processes coexist within a single system. Most existing models require separate components — one for visual comprehension, another for visual generation — often stitched together in cumbersome ways. Show-o eschews this. “Everything, from training to inference, runs through a unified framework. There’s no switching between subsystems or relying on pre-processing pipelines,” adds Asst Prof Shou.
That culminates in a model that can handle a wide range of tasks with minimal fuss. Caption an image? Show-o at your service. Generate an image from a sentence? Easily done. Extend the edge of a photo into imagined space, or replace one object with another using just a prompt? No problem. It even supports video-style applications, such as generating a step-by-step visual sequence from a cooking instruction, complete with corresponding text.
Lean and mean
Despite its breadth, Show-o remains remarkably lean. At just 1.3 billion parameters, it is far smaller than many of today’s flagship models. For comparison, GPT-4 is estimated to have around 1.5 trillion.
“Show-o unifies both capabilities of multimodal understanding and generation into one single model.”
“But sometimes big things come in small packages. Our model can outperform much larger systems on standard benchmarks,” says Asst Prof Shou. For example, in tasks like visual question answering and text-to-image generation, Show-o matches or exceeds models many times its size. It also requires significantly fewer steps to generate high-quality images compared to fully sequential methods, making it faster and more efficient.
The model’s compact, all-purpose design, available as an open-source codebase, has already spurred follow-up research from major research labs, including those at MIT, Nvidia, Meta and NYU. As the first to unify two major generative strategies (stepwise prediction and iterative refinement) in a single network, Show-o has helped reshape how researchers think about building future foundation models.
“Today’s models like Gemini (Google) can analyse a video and answer questions about it, while Sora (OpenAI) can generate realistic video from a text prompt,” adds Asst Prof Shou. “Show-o unifies both capabilities of multimodal understanding and generation into one single model. Thanks to this design, our model’s inputs and outputs can be any combination and order of visual and textual tokens, flexibly supporting a wide range of multimodal tasks, from visual question-answering to text-to-image or video generation to mixed-modality generation.”
The team has since developed a newer model, Show-o2, which delivers improved performance across the board, while remaining compact at around two billion parameters. A larger one with around eight billion is also explored and available now for better performance.
“Ultimately, our goal wasn’t to make a huge model,” says Asst Prof Shou. “We wanted to build a smarter one — a model that could do more.”
Turns out, a jack of all trades can be a master too.