Multimodal RAG is growing, here’s the best way to get started

Share This Post


Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


As companies begin experimenting with multimodal retrieval augmented generation (RAG), companies providing multimodal embeddings — a way to transform data to RAG-readable files — advise enterprises to start small when starting with embedding images and videos. 

Multimodal RAG, RAG that can also surface a variety of file types from text, images or videos, relies on embedding models that transform data into numerical representations that AI models can read. Embeddings that can process all kinds of files let enterprises find information from financial graphs, product catalogs or just any informational video they have and get a more holistic view of their company. 

Cohere, which updated its embeddings model, Embed 3, to process images and videos last month, said enterprises need to prepare their data differently, ensure suitable performance from the embeddings, and better use multimodal RAG.

“Before committing extensive resources to multimodal embeddings, it’s a good idea to test it on a more limited scale. This enables you to assess the model’s performance and suitability for specific use cases and should provide insights into any adjustments needed before full deployment,” a blog post from Cohere staff solutions architect Yann Stoneman said. 

The company said many of the processes discussed in the post are present in many other multimodal embedding models.

Stoneman said, depending on some industries, models may also need “additional training to pick up fine-grain details and variations in images.” He used medical applications as an example, where radiology scans or photos of microscopic cells require a specialized embedding system that understands the nuances in those kinds of images.

Data preparation is key

Before feeding images to a multimodal RAG system, these must be pre-processed so the embedding model can read them well. 

Images may need to be resized so they’re all a consistent size, while organizations need to figure out if they want to improve low-resolution photos so important details don’t get lost or make too high-resolution pictures a lower quality so it doesn’t strain processing time. 

“The system should be able to process image pointers (e.g. URLs or file paths) alongside text data, which may not be possible with text-based embeddings. To create a smooth user experience, organizations may need to implement custom code to integrate image retrieval with existing text retrieval,” the blog said. 

Multimodal embeddings become more useful 

Many RAG systems mainly deal with text data because using text-based information as embeddings is easier than images or videos. However, since most enterprises hold all kinds of data, RAG which can search pictures and texts has become more popular. Organizations often had to implement separate RAG systems and databases, preventing mixed-modality searches. 

Multimodal search is nothing new, as OpenAI and Google offer the same on their respective chatbots. OpenAI launched its latest generation of embeddings models in January. Other companies also provide a way for businesses to harness their different data for multimodal RAG. For example, Uniphore released a way to help enterprises prepare multimodal datasets for RAG.



Source link

Related Posts

L3Harris ramps up satellite production in response to military demand

ORLANDO, Fla. — Defense contractor L3Harris is scaling...

Bluestone IPO: Jewellery retailer Bluestone files DRHP for IPO

Bengaluru-based jewellery retailer Bluestone has filed a draft...

IP Copilot wants to use AI to turn your Slack messages into patents

Join our daily and weekly newsletters for the...

Dubai Leisure: Soak in the Luxury and the Desert Oasis

Dubai, known for its towering skyscrapers, luxury shopping, and...

Video Shows Robot Welding SpaceX Starship

Look at it go!A Cure for WeldnessSpaceX is...

Supreme Court Permits Nvidia Investor Lawsuit

Setback for Nvidia after Supreme Court rules class-action...
- Advertisement -spot_img