Under review
Logo

BayesianVSLNet

Temporal Video Segmentation with Natural Language using Text-Video Cross Attention and Bayesian Order-priors

Carlos Plou* Lorenzo Mur-Labadia* Jose J. Guerrero Ruben Martinez-Cantin Ana C. Murillo
*Joint first authors
DIIS-I3A, Universidad de Zaragoza, Spain

Motivation

The objective of the step grounding task is to localize the temporal boundaries of activities, described in free-form natural language, within long and untrimmed videos.


BayesianVSLNet

Our method, Bayesian-VSLNet, extends the VSLNet architecture which fails to address the two key challenges of the step grounding task: cyclic actions and long videos. First, Bayesian-VSLNet directly predicts, for each video segment, the probability of being associated with the text query, mitigating the adverse effects of repeated actions. Second, we introduce a novel test-time refinement strategy that leverages Bayes’ rule, incorporating temporal-order priors into our predictions. Additionally, we enhance our video representations by aggregating features from various video encoders: Omnivore-L, EgoVideo and EgoVLP-v2.


Hyperparameters

During the Bayesian-based test-time refinement, two hyperpameters play a crucial role. β determines the variance of the prior that controls the smoothness of the posterior. Once we have the posterior, α sets the threshold (α-percentile of the posterior probability value) that controls the length of the predicted clip


Metrics

We propose two novel temporal grounding metrics: IoU-IndAll and IoU-Grouped which measure different aspects related to the step grounding task.


Results: Ego4D Goal-Step dataset


Quantitative results

EgoVis Workshop organized a challenge over Ego4d-GoalStep dataset and code. You will find in the leaderboard the results in the test set for our BayesianVSLNet-v0. This initial version is currently in the first place :rocket::fire:.

Regarding the performance in the validation set, our proposed training and inference strategy enables Bayesian-VSLNet to significantly surpass the baseline VSLNet across all evaluated metrics.


Qualitative results

Bayesian-VSLNet predicts the probability of each step description corresponding to a given video segment. Notably, similar descriptions (e.g., “Adds minced cocoa into milk? vs. “Mix minced cocoa and milk together? in the first example) yield nearly identical predictions. Our temporal ordering mechanism refines the BayesianVSLNet’s probability predictions, aligning them with the correct temporal sequence to enhance segmentation accuracy.


Case study: Robotics

We present qualitative results in a real-world assistive robotics scenario to demonstrate the potential of our approach in enhancing human-robot interaction in practical applications. We select this mobile-ahoha video.


BibTeX

@misc{plou2024carlorego4dstep,
      title={CARLOR @ Ego4D Step Grounding Challenge: Bayesian temporal-order priors for test time refinement}, 
      author={Carlos Plou and Lorenzo Mur-Labadia and Ruben Martinez-Cantin and Ana C. Murillo},
      year={2024},
      eprint={2406.09575},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2406.09575}, 
}