OmniRAG-Agent: Agentic Omnimodal Reasoning for
Low-Resource Long Audio-Video Question Answering


Yifan Zhu  ·  Xinyu Mu  ·  Tao Feng  ·  Zhonghong Ou  ·  Yuning Gong  ·  Haoran Luo

OmniRAG-Agent Framework Overview

Figure 1: Overview of the OmniRAG-Agent framework. The agent processes long video and audio through downsampling, builds a Multi-Modal Retrieval Bank, and interacts with an OmniLLM over multiple turns — issuing Think and Query actions to retrieve relevant image/audio clips — before synthesizing evidence into a final answer. Training uses Format Reward and Answer Reward signals.

Abstract

Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization. To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image–audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization (GRPO) to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.

Key Contributions

🖼️

Image–Audio RAG Module

Per-video FAISS indices built from CLIP-encoded keyframes and Whisper ASR transcripts, enabling fine-grained retrieval of short, relevant clips at inference time without dense video encoding.

🔄

Multi-Turn Agent Loop

An OmniLLM iteratively issues Think and Query actions, incorporating retrieved image/audio responses into an interaction history, and synthesizes evidence into a grounded final answer.

🎯

GRPO-Based RL Training

End-to-end optimization with Group Relative Policy Optimization using Format Reward and Answer Reward, jointly improving both retrieval tool-use quality and final answer correctness.

📊

Strong Benchmark Results

Consistently outperforms closed- and open-source baselines across OmniVideoBench, WorldSense, and Daily-Omni under low-resource (3B / 7B parameter) settings, with comprehensive ablation study.

Method Overview

OmniRAG-Agent combines a lightweight retrieval pipeline with an agentic reasoning loop trained via reinforcement learning. The three core components work together to handle long videos within a strict budget constraint.

Retrieval Bank Construction

Videos are downsampled to at most 30-minute clips; keyframes and 5-second audio segments are extracted, embedded with CLIP and Whisper respectively, and stored in per-video FAISS indices served via a FastAPI endpoint.

OmniLLM Agent Interaction

At each turn the agent emits structured <think> reasoning and optionally a <query> call (image or audio). Retrieved clips are appended to the interaction history before the next turn, up to 20 turns total.

GRPO Reward Signals

Two rewards guide training: Format Reward checks that structured tags (<think>…</think>) are present, and Answer Reward matches the final answer against the ground truth option, with group-relative advantage estimation.

Experimental Results

Evaluated on three benchmarks: OmniVideoBench (audio-visual QA), WorldSense (domain-knowledge video QA), and Daily-Omni (daily-life audio-visual reasoning). Bold indicates the best average performance among open-source methods.

3
Benchmarks
29.69%
OmniVideoBench (7B)
38.28%
WorldSense (7B)
44.92%
Daily-Omni (7B)
Table 1: OmniVideoBench
Comparison of baselines on OmniVideoBench. Bold = best average performance.
MethodMod. Compare
Attr
AudioBgReasoning Logic
Ref
Ego
Spatial
Perception TimeTempText
Sense
Avg
Closed-source Models
GPT-5.1V22.2227.2720.8326.0929.1719.674.0040.6224.51
Gemini 2.0-FlashA+V22.2236.3625.0030.4326.7629.5116.0031.2527.34
+ RAGA+V44.4436.3612.5043.4828.1724.5920.0046.8829.69
+ RAG + AgentA+V22.2236.3616.6734.7829.5829.5124.0046.8830.47
Gemini 2.5-FlashA+V22.2218.1825.0034.7835.2131.154.0031.2528.52
+ RAGA+V11.1136.3620.8330.4333.8036.0712.0053.1232.42
+ RAG + AgentA+V44.4427.2729.1730.4333.8034.4328.0053.1235.16
Open-source Models
Qwen2.5-Omni-3BA+V22.229.0925.0017.3925.3518.0328.0031.2523.05
+ RAGA+V33.3327.2733.3326.0926.7616.3924.0025.0024.61
+ RAG + AgentA+V22.2226.3637.5026.0925.3524.5924.0028.1226.95
+ RAG + Agent + RLA+V33.3327.2733.3321.7426.7621.3136.0031.2527.34
Qwen2.5-Omni-7BA+V22.2218.1825.0026.0923.9427.8716.0031.2525.00
+ RAGA+V22.2227.2712.5039.1322.5424.5940.0040.6227.73
+ RAG + AgentA+V44.4445.4529.1726.0928.1722.9524.0034.5228.52
+ RAG + Agent + RLA+V33.3327.2712.5026.0932.3934.4320.0037.5029.69
Qwen3-Omni-30BA+V33.3318.1812.5026.0932.3932.7912.0034.3827.73
+ RAGA+V11.1136.368.3326.0938.0326.2320.0034.3828.12
+ RAG + AgentA+V33.3336.3616.6721.7438.0327.8728.0025.0029.30
Table 2: WorldSense
Comparison of baselines on WorldSense. Bold = best average performance.
MethodMod. Tech &
Science
Culture &
Politics
Daily
Life
Film &
TV
Perfor-
mance
Games SportsMusicAvg
Closed-source Models
Gemini 2.5-FlashA+V47.3723.0827.7840.0027.2729.4125.0043.2433.59
+ RAGA+V55.2638.4629.6336.6731.8241.1821.8845.9537.50
+ RAG + AgentA+V57.8930.7742.5933.3331.8241.1825.0045.9539.84
Open-source Models
Qwen2.5-Omni-3BA+V34.2142.3227.7830.0013.6429.4115.6240.5429.69
+ RAGA+V34.2126.9225.9330.0031.8241.1831.2535.1431.25
+ RAG + AgentA+V36.8426.9231.4826.6727.2729.4140.6245.9533.98
+ RAG + Agent + RLA+V36.8430.7740.7440.0036.3635.2915.6251.3536.71
Qwen2.5-Omni-7BA+V34.2130.7729.6333.3322.7329.4118.7540.5430.47
+ RAGA+V21.0542.3137.0440.0013.6429.4125.0045.9532.81
+ RAG + AgentA+V31.5850.0025.9330.0050.0035.2921.8843.2434.38
+ RAG + Agent + RLA+V44.7423.0840.7440.0040.9141.1825.0045.9538.28
Table 3: Daily-Omni
Comparison of baselines on Daily-Omni. Bold = best average performance.
MethodMod. AV Event
Alignment
ComparativeContext
Understand.
Event
Sequence
InferenceReasoningAvg
Closed-source Models
Gemini 2.5-FlashA+V34.6334.3850.0037.1056.8251.1144.53
+ RAGA+V38.4639.3945.6541.9454.5555.5646.48
+ RAG + AgentA+V26.9234.3846.6553.2359.0960.0048.83
Open-source Models
Qwen2.5-Omni-3BA+V39.2221.4341.4625.7633.3329.7332.03
+ RAGA+V27.4528.5760.9822.7348.4837.8435.94
+ RAG + AgentA+V37.2532.1441.4625.7642.4251.3537.11
+ RAG + Agent + RLA+V39.2228.5741.4640.5436.3651.3540.09
Qwen2.5-Omni-7BA+V29.4128.5741.4651.5221.2132.4336.33
+ RAGA+V37.2535.7146.3433.3342.4248.6539.84
+ RAG + AgentA+V45.1039.2956.1033.3348.4835.1442.19
+ RAG + Agent + RLA+V27.4532.1451.2259.0945.4545.9544.92

Generalization & Efficiency Analysis

Ablation studies examining the effect of retrieval budget (number of retrieved clips) and per-category performance across different model backbones.

Budget and per-type analysis

Figure 2: Retrieval budget analysis. (a–b) Accuracy vs. number of retrieved image/audio clips for Qwen-Omni-7B and Qwen-Omni-7B+GRPO. (c–e) Per-category dot-plot comparisons showing improvements from +RAG+Agent+RL across all question types on OmniVideoBench.

Per-model radar chart comparison

Figure 3: Per-model radar chart comparison on OmniVideoBench. Radar charts for (a) Gemini 2.0-Flash, (b) Gemini 2.5-Flash, (c) Qwen2.5-Omni-3B, (d) Qwen2.5-Omni-7B, and (e) Qwen3-Omni-30B, comparing Baseline vs. RAG+Agent (and +RL for open-source models) across all eight question categories.

Citation

If you find this work helpful, please cite:

@article{zhu2026omnirag,
  title={OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering},
  author={Zhu, Yifan and Mu, Xinyu and Feng, Tao and Ou, Zhonghong and Gong, Yuning and Luo, Haoran},
  journal={arXiv preprint arXiv:2602.03707},
  year={2026}
}