OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video QA

Abstract

Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization. To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image–audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization (GRPO) to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.

Key Contributions

🖼️

Image–Audio RAG Module

Per-video FAISS indices built from CLIP-encoded keyframes and Whisper ASR transcripts, enabling fine-grained retrieval of short, relevant clips at inference time without dense video encoding.

🔄

Multi-Turn Agent Loop

An OmniLLM iteratively issues Think and Query actions, incorporating retrieved image/audio responses into an interaction history, and synthesizes evidence into a grounded final answer.

🎯

GRPO-Based RL Training

End-to-end optimization with Group Relative Policy Optimization using Format Reward and Answer Reward, jointly improving both retrieval tool-use quality and final answer correctness.

📊

Strong Benchmark Results

Consistently outperforms closed- and open-source baselines across OmniVideoBench, WorldSense, and Daily-Omni under low-resource (3B / 7B parameter) settings, with comprehensive ablation study.

Method Overview

OmniRAG-Agent combines a lightweight retrieval pipeline with an agentic reasoning loop trained via reinforcement learning. The three core components work together to handle long videos within a strict budget constraint.

Retrieval Bank Construction

Videos are downsampled to at most 30-minute clips; keyframes and 5-second audio segments are extracted, embedded with CLIP and Whisper respectively, and stored in per-video FAISS indices served via a FastAPI endpoint.

OmniLLM Agent Interaction

At each turn the agent emits structured <think> reasoning and optionally a <query> call (image or audio). Retrieved clips are appended to the interaction history before the next turn, up to 20 turns total.

GRPO Reward Signals

Two rewards guide training: Format Reward checks that structured tags (<think>…</think>) are present, and Answer Reward matches the final answer against the ground truth option, with group-relative advantage estimation.

Experimental Results

Evaluated on three benchmarks: OmniVideoBench (audio-visual QA), WorldSense (domain-knowledge video QA), and Daily-Omni (daily-life audio-visual reasoning). Bold indicates the best average performance among open-source methods.

Benchmarks

29.69%

OmniVideoBench (7B)

38.28%

WorldSense (7B)

44.92%

Daily-Omni (7B)

Table 1: OmniVideoBench

Comparison of baselines on OmniVideoBench. Bold = best average performance.

Method	Mod.	Compare Attr	AudioBg	Reasoning	Logic Ref	Ego Spatial	Perception	TimeTemp	Text Sense	Avg
Closed-source Models
GPT-5.1	V	22.22	27.27	20.83	26.09	29.17	19.67	4.00	40.62	24.51
Gemini 2.0-Flash	A+V	22.22	36.36	25.00	30.43	26.76	29.51	16.00	31.25	27.34
+ RAG	A+V	44.44	36.36	12.50	43.48	28.17	24.59	20.00	46.88	29.69
+ RAG + Agent	A+V	22.22	36.36	16.67	34.78	29.58	29.51	24.00	46.88	30.47
Gemini 2.5-Flash	A+V	22.22	18.18	25.00	34.78	35.21	31.15	4.00	31.25	28.52
+ RAG	A+V	11.11	36.36	20.83	30.43	33.80	36.07	12.00	53.12	32.42
+ RAG + Agent	A+V	44.44	27.27	29.17	30.43	33.80	34.43	28.00	53.12	35.16
Open-source Models
Qwen2.5-Omni-3B	A+V	22.22	9.09	25.00	17.39	25.35	18.03	28.00	31.25	23.05
+ RAG	A+V	33.33	27.27	33.33	26.09	26.76	16.39	24.00	25.00	24.61
+ RAG + Agent	A+V	22.22	26.36	37.50	26.09	25.35	24.59	24.00	28.12	26.95
+ RAG + Agent + RL	A+V	33.33	27.27	33.33	21.74	26.76	21.31	36.00	31.25	27.34
Qwen2.5-Omni-7B	A+V	22.22	18.18	25.00	26.09	23.94	27.87	16.00	31.25	25.00
+ RAG	A+V	22.22	27.27	12.50	39.13	22.54	24.59	40.00	40.62	27.73
+ RAG + Agent	A+V	44.44	45.45	29.17	26.09	28.17	22.95	24.00	34.52	28.52
+ RAG + Agent + RL	A+V	33.33	27.27	12.50	26.09	32.39	34.43	20.00	37.50	29.69
Qwen3-Omni-30B	A+V	33.33	18.18	12.50	26.09	32.39	32.79	12.00	34.38	27.73
+ RAG	A+V	11.11	36.36	8.33	26.09	38.03	26.23	20.00	34.38	28.12
+ RAG + Agent	A+V	33.33	36.36	16.67	21.74	38.03	27.87	28.00	25.00	29.30

Table 2: WorldSense

Comparison of baselines on WorldSense. Bold = best average performance.

Method	Mod.	Tech & Science	Culture & Politics	Daily Life	Film & TV	Perfor- mance	Games	Sports	Music	Avg
Closed-source Models
Gemini 2.5-Flash	A+V	47.37	23.08	27.78	40.00	27.27	29.41	25.00	43.24	33.59
+ RAG	A+V	55.26	38.46	29.63	36.67	31.82	41.18	21.88	45.95	37.50
+ RAG + Agent	A+V	57.89	30.77	42.59	33.33	31.82	41.18	25.00	45.95	39.84
Open-source Models
Qwen2.5-Omni-3B	A+V	34.21	42.32	27.78	30.00	13.64	29.41	15.62	40.54	29.69
+ RAG	A+V	34.21	26.92	25.93	30.00	31.82	41.18	31.25	35.14	31.25
+ RAG + Agent	A+V	36.84	26.92	31.48	26.67	27.27	29.41	40.62	45.95	33.98
+ RAG + Agent + RL	A+V	36.84	30.77	40.74	40.00	36.36	35.29	15.62	51.35	36.71
Qwen2.5-Omni-7B	A+V	34.21	30.77	29.63	33.33	22.73	29.41	18.75	40.54	30.47
+ RAG	A+V	21.05	42.31	37.04	40.00	13.64	29.41	25.00	45.95	32.81
+ RAG + Agent	A+V	31.58	50.00	25.93	30.00	50.00	35.29	21.88	43.24	34.38
+ RAG + Agent + RL	A+V	44.74	23.08	40.74	40.00	40.91	41.18	25.00	45.95	38.28

Table 3: Daily-Omni

Comparison of baselines on Daily-Omni. Bold = best average performance.

Method	Mod.	AV Event Alignment	Comparative	Context Understand.	Event Sequence	Inference	Reasoning	Avg
Closed-source Models
Gemini 2.5-Flash	A+V	34.63	34.38	50.00	37.10	56.82	51.11	44.53
+ RAG	A+V	38.46	39.39	45.65	41.94	54.55	55.56	46.48
+ RAG + Agent	A+V	26.92	34.38	46.65	53.23	59.09	60.00	48.83
Open-source Models
Qwen2.5-Omni-3B	A+V	39.22	21.43	41.46	25.76	33.33	29.73	32.03
+ RAG	A+V	27.45	28.57	60.98	22.73	48.48	37.84	35.94
+ RAG + Agent	A+V	37.25	32.14	41.46	25.76	42.42	51.35	37.11
+ RAG + Agent + RL	A+V	39.22	28.57	41.46	40.54	36.36	51.35	40.09
Qwen2.5-Omni-7B	A+V	29.41	28.57	41.46	51.52	21.21	32.43	36.33
+ RAG	A+V	37.25	35.71	46.34	33.33	42.42	48.65	39.84
+ RAG + Agent	A+V	45.10	39.29	56.10	33.33	48.48	35.14	42.19
+ RAG + Agent + RL	A+V	27.45	32.14	51.22	59.09	45.45	45.95	44.92

Generalization & Efficiency Analysis

Ablation studies examining the effect of retrieval budget (number of retrieved clips) and per-category performance across different model backbones.

Figure 2: Retrieval budget analysis. (a–b) Accuracy vs. number of retrieved image/audio clips for Qwen-Omni-7B and Qwen-Omni-7B+GRPO. (c–e) Per-category dot-plot comparisons showing improvements from +RAG+Agent+RL across all question types on OmniVideoBench.

Figure 3: Per-model radar chart comparison on OmniVideoBench. Radar charts for (a) Gemini 2.0-Flash, (b) Gemini 2.5-Flash, (c) Qwen2.5-Omni-3B, (d) Qwen2.5-Omni-7B, and (e) Qwen3-Omni-30B, comparing Baseline vs. RAG+Agent (and +RL for open-source models) across all eight question categories.

OmniRAG-Agent: Agentic Omnimodal Reasoning for
Low-Resource Long Audio-Video Question Answering

Abstract

Key Contributions

Image–Audio RAG Module

Multi-Turn Agent Loop

GRPO-Based RL Training

Strong Benchmark Results

Method Overview

Retrieval Bank Construction

OmniLLM Agent Interaction

GRPO Reward Signals

Experimental Results

Generalization & Efficiency Analysis

Citation

OmniRAG-Agent: Agentic Omnimodal Reasoning forLow-Resource Long Audio-Video Question Answering

Abstract

Key Contributions

Image–Audio RAG Module

Multi-Turn Agent Loop

GRPO-Based RL Training

Strong Benchmark Results

Method Overview

Retrieval Bank Construction

OmniLLM Agent Interaction

GRPO Reward Signals

Experimental Results

Generalization & Efficiency Analysis

Citation

OmniRAG-Agent: Agentic Omnimodal Reasoning for
Low-Resource Long Audio-Video Question Answering