Multimodal AI: A Guide to Open-Source Vision Language Models

October 11, 2024 • Written By Sherlock Xu

You wake up in the morning, and there is another new AI model making headlines. It’s not surprising anymore, right? These days, it feels like a new model drops every other day, each promising more powerful than the last.

Take Llama 3.2 Vision as an example, the first multimodal models in Meta’s open-source Llama series. They push the boundaries beyond text understanding to now include images. But don’t get it twisted. Multimodal AI is about more than just images and text. These models are capable of processing multiple types of information, from images and audio to video and text. And not just open-source AI, proprietary models like GPT-4 already expanded their capabilities by integrating these modalities.

Compared with proprietary models, open-source models remain a favorite for those looking for more secure, affordable, and customizable solutions. In this blog post, we’re introducing some of the most popular open-source multimodal models available today.

Since the world of multimodal AI is broad, we will be focused on vision language models (VLMs). These models are designed to understand and process both visual and text information. At the same time, we will explore some FAQs about VLMs.

Llama 3.2 Vision

Llama 3.2 Vision, developed by Meta, is a collection of multimodal LLM designed to process both text and images. Available in 11B and 90B parameter sizes, Llama 3.2 Vision outperforms many open-source and proprietary models in image-text tasks.

To support image input, Meta integrates a pre-trained image encoder into the language model using adapters, which connect image data to the text-processing layers. This allows the models to handle both image and text inputs simultaneously.

Key features:

  • Multimodal capabilities: Llama 3.2 Vision can perform image-text tasks, including generating captions, answering image-based questions, and complex visual reasoning.
  • Strong performance: Both the 11B and 90B versions outperform proprietary models like Claude 3 Haiku in tasks involving chart and diagram understanding.
  • Customizability: You can fine-tune Llama 3.2 Vision models for custom applications using Torchtune.

Points to be cautious about:

  • Math reasoning: According to Meta’s benchmark, Llama 3.2 Vision still shows room for improvement in math-heavy tasks, especially the 11B version.
  • Language support: Although it supports multiple languages like German, French and Italian for text-only tasks, for image+text applications, only English is supported.

To deploy Llama 3.2 Vision, check out our blog post or simply run openllm serve llama3.2:11b-vision with OpenLLM.

NVLM 1.0

NVLM is a family of multimodal LLMs developed by NVIDIA, representing a frontier-class approach to VLMs. It achieves state-of-the-art results in tasks that require a deep understanding of both text and images. The first public iteration, NVLM 1.0, rivals top proprietary models like GPT-4o, as well as open-access models like Llama 3-V 405B.

Key features:

  • Distinct architectures: The NVLM 1.0 family consists of three unique architectures for different use cases.

    • NVLM-D: A decoder-only architecture that provides unified multimodal reasoning and performs better at OCR-related tasks.
    • NVLM-X: A cross-attention-based architecture that is computationally efficient, particularly when handling high-resolution images.
    • NVLM-H: A hybrid architecture that combines the strengths of both the decoder-only and cross-attention approaches. It delivers superior performance in multimodal reasoning and image processing.
  • Powerful image reasoning: NVLM 1.0 surpasses many proprietary and open-source models in tasks such as OCR, multimodal reasoning, and high-resolution image handling. It demonstrates exceptional scene understanding capability. According to the sample image provided by NVIDIA, it is able to identify potential risks and suggest actions based on visual input.

    nvlm-demo-multi-image.png
    Image Source: NVLM: Open Frontier-Class Multimodal LLMs
  • Improved text-only performance: NVIDIA researchers observed that while open multimodal LLMs often achieve strong results in vision language tasks, their performance tends to degrade in text-only tasks. Therefore, they developed “production-grade multimodality” for the NVLM models. This enables NVLM models to excel in both vision language tasks and text-only tasks (average accuracy increased by 4.3 points after multimodal training).

Points to be cautious about:

  • There is still much more to explore with NVLM 1.0. At the time of writing, NVIDIA has only released NVLM-1.0-D-72B, the decoder-only model weights and code for the community.

Molmo

Molmo is a family of open-source VLMs developed by the Allen Institute for AI. Available in 1B, 7B, and 72B parameters, Molmo models deliver state-of-the-art performance for their class. According to the benchmarks, they can perform on a par with proprietary models like GPT-4V, Gemini 1.5 Pro, and Claude 3.5 Sonnet.

The key to Molmo’s performance lies in its unique training data, PixMo. This highly curated dataset consists of 1 million image-text pairs and includes two main types of data:

  • Dense captioning data for multimodal pre-training
  • Supervised fine-tuning data to enable various user interactions, such as question answering, document reading, and even pointing to objects in images.

Interestingly, Molmo researchers used an innovative approach to data collection: Asked annotators to provide spoken descriptions of images within 60 to 90 seconds. Specifically, these detailed descriptions included everything visible, even the spatial positioning and relationships among objects. The results show that annotators provided detailed captions far more efficiently than traditional methods (writing them down). Overall, they collected high-quality audio descriptions for 712k images that were sampled from 50 high-level topics.

Key features:

  • State-of-the-art performance: Molmo’s 72B model outperforms proprietary models like Gemini 1.5 Pro and Claude 3.5 Sonnet on academic benchmarks. Even the smaller 7B and 1B models rival GPT-4V in several tasks.
  • Pointing capabilities: Molmo can “point” to one or more visual elements in the image. Pointing provides a natural explanation grounded in image pixels. Molmo researchers believe that in the future pointing will be an important communication channel between VLMs and agents. For example, a web agent could query the VLM for the location of specific objects.
  • Open architecture: The original developers promise to release all artifacts used in creating Molmo, including the PixMo dataset, training code, evaluations, and intermediate checkpoints. This offers a new standard for building high-performing multimodal systems from scratch and promotes reproducibility.

Points to be cautious about:

Qwen2-VL

Qwen2-VL is the latest iteration of the VLMs in the Qwen series. It now goes beyond basic recognition of objects like plants and landmarks to understand complex relationships among multiple objects in a scene. In addition, it is capable of identifying handwritten text and multiple languages within images.

Qwen2-VL also extends its capabilities to video content, supporting video summarization, question answering, and real-time conversations around videos.

Key features:

  • State-of-the-art performance: Qwen2-VL achieves top performance on various visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, and MTVQA. Its 72B version outperforms GPT-4o and Claude-3.5 Sonnet on most image benchmarks.
  • Video comprehension: With online streaming features, Qwen2-VL can analyze videos over 20 minutes long and handle questions about the videos.
  • Flexible architecture: Qwen2-VL comes with 2B, 7B, 72B parameter sizes and offers different quantization versions (e.g. AWQ and GPTQ). You can integrate them for different use cases as needed. Some of the models can even run on devices like mobile phones and robots for automatic operation based on visual environments and text instructions.
  • Multilingual support: Qwen2-VL can understand text in various languages within images, including English, Chinese, most European languages, Japanese, Korean, Arabic, and Vietnamese.

Points to be cautious about:

  • Following complex instructions: The model requires further enhancement in understanding and executing intricate multi-step instructions.
  • Counting accuracy: The model's accuracy in object counting is low and needs improvement, particularly in complex scenes.
  • Spatial reasoning skills: The model struggles with inferring object positional relationships, especially in 3D spaces.

You can find more limitation information on the model’s GitHub repository.

Pixtral

Pixtral is a 12 billion parameter open-source model developed by Mistral, marking the company's first foray into multimodal capabilities. Pixtral is designed to understand both images and text, released with open weights under the Apache 2.0 license.

As an instruction-tuned model, Pixtral is pre-trained on a large-scale dataset of interleaved image and text documents. Therefore, it is capable of multi-turn, multi-image conversations. Unlike previous open-source models, Pixtral maintains excellent text benchmark performance while excelling in multimodal tasks.

Key features:

  • Outstanding instruction following capability: Benchmark results indicate that Pixtral 12B significantly outperforms other open-source multimodal models like Qwen2-VL 7B, LLaVa-OneVision 7B, and Phi-3.5 Vision in instruction following tasks. Mistral has created new benchmarks, MM-IF-Eval and MM-MT-Bench, to further assess performance in multimodal contexts, where Pixtral also excels. These benchmarks are expected to be open-sourced for the community in the near future.

    pixtral-benchmark.png
    Image Source: Pixtral Announcement Blog Post
  • Multi-image processing: Pixtral can handle multiple images in a single input, processing them at their native resolution. The model supports a context window of 128,000 tokens and can ingest images with varied sizes and aspect ratios.

Points to be cautious about:

  • Lack of moderation mechanisms: Currently, Pixtral does not include any built-in moderation features. This means it may not be applicable to cases that require controlled outputs.

To deploy Pixtral 12B, you can run openllm serve pixtral:12b with OpenLLM.

Do I really need VLMs?

This is probably the first question you should ask yourself. Also, think about the type of data your application needs to process. If your use case only requires text, an LLM is often sufficient. However, if you need to analyze both text and images, a VLM is a reasonable choice.

If you choose a VLM, be aware that certain models may compromise their text-only performance to excel in multimodal tasks. This is why some model developers emphasize that their new models, such as NVLM and Pixtral, do not sacrifice text performance for multimodal capabilities.

For other modalities, note that different models may be specialized for particular fields, such as document processing and audio analyzing. These are more suited for multimodal scenarios beyond just text and images.

What should I consider when deploying VLMs?

Consider the following factors to ensure optimal performance and usability:

Infrastructure requirements

VLMs often require significant computational resources due to their large size. Top-performing open-source models like the above-mentioned ones can reach over 70 billion parameters. This means you need high-performance GPUs to run them, especially for real-time applications.

If you are looking for a solution that simplifies this process, you can try BentoCloud. It seamlessly integrates cutting-edge AI infrastructure into enterprises’ private cloud environment. Its cloud-agnostic approach allows AI teams to select the cloud regions with the most competitive GPU rates. As BentoCloud offloads the infrastructure burdens, you can focus on building the core functions with your VLM.

Multimodal handling

Not all model serving and deployment frameworks are designed to handle multimodal inputs, such as text, images, and videos. To leverage the full potential of your VLM, ensure your serving and deployment framework can accommodate and process multiple data types simultaneously.

BentoML supports a wide range of data types, such as text, images, audios and documents. You can easily integrate it with your existing ML workflow without a custom pipeline for handling multimodal inputs.

Fast scaling

VLMs are often used in demanding applications such as:

  • Real-time image captioning for large-scale media platforms.
  • Visual search in e-commerce, where users upload images to find similar products.
  • Visual question answering in customer support or educational tools.

In these use cases, traffic can spike unpredictably based on user behavior. This means your deployment framework should support fast scaling during peak hours. BentoML provides easy building blocks to create scalable APIs, allowing you to deploy and run any VLMs on BentoCloud. Its autoscaling feature makes sure you only pay for the resources you use.

How do I interpret the benchmarks mentioned by VLM providers?

Each benchmark serves a specific purpose and can highlight different capabilities of models. Here are five popular benchmarks for VLMs:

  • MMMU evaluates multimodal models in advanced tasks that require college-level subject knowledge and reasoning. It contains 11.5K questions from college exams, quizzes, and textbooks across six disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Sciences, and Tech & Engineering.
  • MMBench is a comprehensive benchmark that assesses a model's performance across various applications of multimodal tasks. It includes over 3,000 multiple-choice questions, covering 20 different ability dimensions, such as object localization, spatial reasoning, and social interaction. Each ability dimension contains at least 125 questions, ensuring a balanced evaluation of a model’s vision language capabilities.
  • ChartQA tests a model's capacity to extract relevant information from various types of charts (e.g., bar charts, line graphs, and pie charts) and answer questions about the data. It assesses skills such as trend analysis, data comparison, and numerical reasoning based on visual representations.
  • DocVQA focuses on the comprehension of complex documents that contain both textual and visual elements, such as forms, receipts, charts, and diagrams embedded in documents.
  • MathVista evaluates models in the domain of mathematical reasoning and problem-solving. It presents problems that combine visual elements, such as geometric figures, plots, or diagrams, with textual descriptions and questions.

One thing to note is that you should always treat benchmarks with caution. They are important, but by no means the only reference for choosing the right model for your use case.

Final thoughts

Over the past month, we’ve seen a wave of powerful open-source VLMs emerge. Is this a coincidence, or are LLMs moving towards multimodal capabilities as a trend? It may be too early to say for sure. What remains unchanged is the need for robust solutions to quickly and securely deploy these models into production at scale.

If you have questions about productionizing VLMs, check out the following resources: