FALCONEye

Motivation

Information retrieval in hour-long videos presents a significant challenge, even for state-of-the-art Vision-Language Models (VLMs), particularly when the desired information is localized within a small subset of frames. Given a question and an one-hour-long video, Video Answer Search (VAS) targets to accurately pinpoint the answer and the corresponding short temporal clip that contains the answer.

Are you ready to prove your skills? Accept the Video Answer Search challenge and find the answer!

FALCONEye: A Novel Video Agent

We propose a novel video agent, FALCONEye, which combines a VLM and a Large Language Model (LLM) to reason and search relevant information along the video, and locate the frames with the answer.

FALCONEye novelty relies on 1) the proposed meta-architecture, which is better suited to tackle hour-long videos compared to short video approaches in the state-of-the-art; 2) a new efficient exploration algorithm to locate the information using short clips, captions and answer confidence; and 3) our state-of-the-art VLMs calibration analysis for the answer confidence.

Exploration Algorithm

We propose a novel search algorithm that emulates human-like VAS behavior to iteratively focus on video clips most likely to contain the answer. It leverages captions, question-answers semantics, and confidence scores to optimize localization, avoiding irrelevant clips and concentrating resources on exploring temporally relevant segments until finding the answer.

VLMs Calibration

Since our exploration algorithm relies on VLM answer confidence, we must evaluate if the confidence values are well calibrated. For MCQs, confidence computation is straightforward, as only a single token is output. However, for OQs, probability tokens must be aggregated across the entire sequence of tokens. We investigate various aggregation metrics, identifying the geometric average as the most suitable approach for VLMs. Lastly, we adopt Reliability Diagrams, which group predictions into bins based on confidence and measure the gap (calibration error) between confidence and accuracy for each bin. Calibration plots for LLaVa-Video 7B, LLaVa-OneVision 7B, and Qwen2.5-VL 7B reveal a surprisingly high calibration performance.

FALCON-Bench

The FALCON-Bench test set comprises 506 questions built over 80 one-hour-long videos sourced from three different recognized datasets as: SoccerNet, MovieChat-1K and Walking_Tours.

While our research focus is centered on open-questions, we provide 4 choices per question to enable multiple-choice question evaluation mode. Besides, we provide the ground truth temporal clip annotation (gt_time_interval) and a ground truth frame (gt_frame_idx) within the answer is contained.

Results

We evaluate our meta-architecture using Qwen2.5-VL 7B as the VLM and GPT4o-mini as the LLM. Most VLMs that are limited to a small number of sampled frames fail to outperform the LLM-blind baselines in MCQs, suggesting that their advantage over random guessing stems from discarding certain options, rather than actual answer retrieval from visual content. Only LLaVA-Video and Apollo show clear improvements. This trend is not observed in OQs, where models must generate answers independently, validating their reliability. Among meta-architectures, only FALCONEye demonstrates robustness for long-form VAS, achieving 70.0% accuracy in MCQs and 44.7% in OQs with its top-performance configuration, and 64.7% and 41.1%, respectively, in the cost-efficient variant.

BibTeX

@article{plou2025falconeyefindinganswerslocalizing,
      title={FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs}, 
      author={Carlos Plou and Cesar Borja and Ruben Martinez-Cantin and Ana C. Murillo},
      year={2025},
      eprint={2503.19850},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.19850},
}