VERA: Explainable Video Anomaly Detection via Verbalized Learning of
Vision-Language Models
Muchao Ye1 Weiyang Liu2 Pan He3
1
The University of Iowa 2 Max Planck Institute for Intelligent Systems – Tübingen
Project Page: https://vera-framework.github.io
3
Auburn University
Existing pipeline 1 uses external LLMs to assist frozen VLMs in reasoning
arXiv:2412.01095v1 [cs.AI] 2 Dec 2024
Abstract
…
VLM
…
The rapid advancement of vision-language models
(VLMs) has established a new paradigm in video anomaly
detection (VAD): leveraging VLMs to simultaneously detect anomalies and provide comprehendible explanations
for the decisions. Existing work in this direction often
assumes the complex reasoning required for VAD exceeds
the capabilities of pretrained VLMs. Consequently, these
approaches either incorporate specialized reasoning modules during inference or rely on instruction tuning datasets
through additional training to adapt VLMs for VAD. However, such strategies often incur substantial computational
costs or data annotation overhead. To address these challenges in explainable VAD, we introduce a verbalized learning framework named VERA that enables VLMs to perform VAD without model parameter modifications. Specifically, VERA automatically decomposes the complex reasoning required for VAD into reflections on simpler, more
focused guiding questions capturing distinct abnormal patterns. It treats these reflective questions as learnable parameters and optimizes them through data-driven verbal
interactions between learner and optimizer VLMs, using
coarsely labeled training data. During inference, VERA
embeds the learned questions into model prompts to guide
VLMs in generating segment-level anomaly scores, which
are then refined into frame-level scores via the fusion of
scene and temporal contexts. Experimental results on challenging benchmarks demonstrate that the learned questions
of VERA are highly adaptable, significantly improving both
detection performance and explainability of VLMs for VAD.
❄
Describe
LLM
Reason
…
❄ or 🔥
…
…
Abnormal Video
Training Sets with Coarse Labels
Test Video
Existing pipeline 2 utilizes instruction tuning to make VLMs integrated VAD systems
…
VLM
🔥
…
…
🔥
…
…
Normal Abnormal Abnormal Normal
Describe & Reason
🧑💻
Test Video
Instruction Tuning Datasets with
Frame-Level Annotations
VERA employs learnable guiding questions to elicit the reasoning of frozen VLMs
VAD Guiding Questions
…
VLM
…
Test Video
Inference
❄
Describe & Reason
Training
❄ Frozen Module
🔥
…
…
…
Abnormal Video
Training Sets with Coarse Labels
🔥 Learnable Module
🧑💻 Extra Annotation
Figure 1. VERA renders frozen VLMs to describe and reason with learnable guiding questions learned from coarsely labeled data.
derstanding and the ability to generate human-interpretable
predictions. The rapid advancement of vision language
models (VLMs) [8, 20, 23, 61] enables us to address both
requirements through their strong visual reasoning and language interaction capabilities. As multi-modal architectures
that effectively combine the reasoning capabilities from
large language models (LLMs) [4] and the visual understanding capabilities from pretrained vision encoders [9],
VLMs are particularly well-suited for VAD for they can
offer explainable predictions that clearly illustrate the rationale behind specific anomalies, making the results more
interpretable to users. Recent research on VAD has consequently focused on how to effectively leverage the power of
pretrained VLM. As shown in Fig. 1, existing approaches
aim to address the misalignment problem between VLMs’
pretraining tasks and the VAD requirements through either
additional reasoning modules or instruction tuning (IT):
• One line of research introduces external LLMs to assist
frozen VLMs to reason in VAD [49, 55]. It uses VLMs to
caption what they see given a video, and the descriptions
are then passed to an external LLM, e.g., GPT-4 [1], to
reason whether an anomaly occurs.
• Another line of research, instead, expands VLMs to generate explainable prediction via IT [29, 58]. This research line creates additional VAD datasets with framelevel annotations and leverages exemplary instructions to
fine-tune the VLM, enabling it to detect anomalies and
generate human-interpretable explanations.
1. Introduction
Video anomaly detection (VAD) aims to automatically identify unexpected and abnormal events in video sequences,
with broad applications ranging from autonomous driving [2] to industrial manufacturing [34]. While achieving
good performance in VAD is essential, providing clear explanations for detected anomalies is even more crucial.
To this end, our work primarily focuses on explainable VAD, which requires both comprehensive visual un1
VAD Question for InternVL2-8B
AUC (%)
“Describe the video and is there any anomaly?” [29] 53.05
“Are there any abnormal events in the video?” [58] 65.03
Key Observations and Research Question. While prior
research demonstrates the potential of applying VLMs to
VAD, we identify that this new paradigm is hindered by a
shared critical issue: the use of additional reasoning modules or fine-grained labeled datasets incurs significant computational cost either in the inference or training phases.
First, decoupling a VAD system into a frozen VLM and
an extra LLM introduces more overhead in inference, because it separates the description generation and reasoning processes. Secondly, although IT-based methods enable
VLMs to effectively integrate description and reasoning for
VAD, they require additional manpower and computational
resources for annotating and finetuning on fine-grained labeled instruction datasets, which is time-consuming and not
scalable for large-scale datasets. In light of this, we investigate the following unexplored yet important question:
Table 1. Instructing a frozen VLM (InternVL2-8B [8]) with simple questions to perform VAD yields poor AUC on UCF-Crime [35] dataset.
for video-specific properties like temporality while relying
solely on provided coarse video-level labels. After that,
considering the large scale of video frames, VERA assigns
a fine-grained anomaly score for each frame in a coarse-tofine manner in the inference phase. First, VERA generates
segment-level anomaly scores by querying VLMs with the
learned guiding questions. Next, VERA improves the initial score by incorporating scene context into each segment
score via ensembling. Finally, VERA outputs frame-level
scores by fusing temporal context via Gaussian smoothing
and frame-level position weighting.
Contributions. To sum up, our contributions are:
• To our knowledge, we present the first approach, that is,
VERA, to adapt frozen VLMs as an integrated system for
VAD by learning detailed anomaly-characterization questions in prompts that decompose anomalies into concrete
and recognizable patterns. VERA learns them directly
from coarsely labeled datasets, eliminating the need for
IT or external reasoning modules.
• We introduce an effective verbalized learning-based algorithm for VLMs in VAD, allowing direct adaptation without modifying model parameters. With coarse labeled
VAD datasets only, our approach obtains good guiding
questions in VAD by relying on the verbal interaction between learner and optimizer VLMs in verbalized training.
Additionally, we design a coarse-to-fine strategy to derive
frame-level anomaly scores from verbally learned guiding
questions in VAD, integrating both scene and temporal
contexts for better VAD performance and reasoning.
• The learned guiding questions from VERA are expressed
in natural languages, providing a unified method to encode and transfer prior VAD knowledge seamlessly to
other datasets or VLMs. In challenging VAD datasets like
UCF-Crime [35] and XD-Violence [45], VERA achieves
state-of-the-art explainable VAD performance and enjoys
good generalization ability across models and datasets.
Can we enable a frozen VLM to integrate description and
reasoning for VAD without instruction tuning?
Our Approach. This research question is nontrivial because the reasoning ability of a frozen VLM is limited in
general visual tasks, and it struggles to handle complex reasoning tasks like VAD, which requires the understanding
of subtle, context-dependent outliers. To illustrate, Table 1
shows that prompting frozen VLMs with simple VAD questions used in existing works leads to unsatisfactory results.
Thus, instruction-tuning a VLM seems necessary to make it
responsive to specific instructional cues and capture delicate
visual variations. In this paper, we question the necessity of
such an operation and propose a principled approach to tailor frozen VLMs for VAD.
Specifically, our solution is guided by the intuition that
the reasoning ability of VLMs for VAD will improve if we
find questions with suitable and concrete description of abnormal patterns rather than with abstract and general words
like “anomaly” to prompt them. Our idea is to iteratively refine anomaly descriptions from abstract ones (e.g., “is there
any anomaly?”) to detailed, specific characterizations.
Driven by such insight, we propose a framework, termed
VERA, to explore verbalized learning for VAD. This framework considers the practical constraint that it is suboptimal to manually write down VAD guiding questions across
VLMs, so it introduces a data-driven learning task to identify suitable anomaly-characterization questions containing concrete abnormal patterns for the frozen VLM using coarsely labeled datasets, eliminating the need for IT.
Specifically, in the training phase, VERA treats the questions guiding the reasoning of VLMs in VAD as learnable
parameters, improving them based on the verbal feedback
from an optimizer VLM on the performance of a learner
VLM on an intermediate VAD subtask—binary video classification for each video in the VAD training set. This design is both efficient and appropriate for VAD, as it accounts
2. Related Work
Video Anomaly Detection. VAD is the task of localizing
frames that contain abnormal events in a given video. This
task is challenging for anomalies cover a broad scope of
events like accidents and criminal activities while training
sets only offer coarse annotations. Modern VAD methods
are based on deep neural networks (DNNs) for their superiority and are going through a paradigm shift in using
VLMs: (1) Early DNNs for VAD are task-specific, which
often employ unsupervised (including one-class) or weakly
2
3. The VERA Framework
supervised (WS) learning techniques for training. Most unsupervised learning methods [25, 28, 40, 41, 51, 59] train
DNNs on frame reconstruction/prediction tasks to establish representation spaces for normal/abnormal videos. WS
learning methods [6, 30, 35, 50, 56] leverage both normal
and abnormal videos to train a feature extractor that distinguishes anomalies from normalcy, typically using multiple
instance learning [35] objectives. (2) Recent VAD methods
adopt VLMs due to their remarkable success across core
vision tasks [13, 23, 31]. Early research [29, 49, 55, 58]
has leveraged VLMs to generate textual descriptions of detected anomalies to enhance prediction explainability for
VAD. However, current approaches incur high processing
demands from external LLMs or require substantial effort
and cost for fine-tuning on additional datasets, which are
computationally inefficient in training or inference. Our
work reduces the processing overhead by adapting frozen
VLMs for VAD without model parameter modification or
extra reasoning modules via learnable guiding questions,
which elicit superior reasoning from frozen VLMs and significantly boost their performance in VAD.
Our approach adapts VLMs to detect video anomalies without additional reasoning modules or instruction tuning. We
now formulate the VAD task and detail the design of VERA.
3.1. Problem Formulation
Video Anomaly Detection. Let V be a video with F
frames, represented as V = {Ii }F
i=1 , where Ii is the i-th
frame (1 ≤ i ≤ F ). Our objective is to locate and detect
the start and end of anomalous events within V . In standard
labeling, any frame associated with an anomaly is labeled as
1, and normal frames are labeled as 0. Therefore, the ground
truth label sequence for V is Y = [y1 , . . . , yF ], where
yi ∈ {0, 1} represents the fine-grained label for Ii . We
aim to use a frozen VLM, fVLM , to generate anomaly score
predictions across all frames, Ŷ = [ŷ1 , . . . , ŷF ], where
ŷi ∈ [0, 1] is a continuous anomaly score for Ii .
Available Training Data for VAD. Typically, VAD
datasets only provide coarsely labeled training sets [25,
28, 35, 45]. We denote a VAD training set as D =
{(V (j) , Y (j) )}N
j=1 , where N is the total number of train(j)
ing videos, V
represents the j-th video (1 ≤ j ≤ N )
and Y (j) is the corresponding video-level label. Y (j) = 1
if V (j) contains any anomaly defined by the dataset annotators, e.g., abuse or arson activities, and Y (j) = 0 if V (j) has
no anomalies. For V (j) , we suppose it contains Fj frames
(j) Fj
and denote the frames sequence as V (j) = {Ii }i=1
, where
(j)
(j)
Ii is the i-th frame (1 ≤ i ≤ Fj ) in V .
Verbalized Learning for VLMs. The designed verbalized
learning framework is inspired by a recent technique called
verbalized machine learning (VML) [47]. The main idea of
VML is to use LLMs to approximate functions and learn
the verbal rules and descriptions of performing specific
tasks, which casts traditional machine learning tasks such
as regression and classification as language-based learning tasks. This approach regards the language expressions
that define classification rules and other task-specific criteria as learned parameters, and optimize them in a datadriven fashion through interactions between a learner and
an optimizer modeled by LLMs or VLMs. However, the
VML framework is limited to tasks involving regression on
scalar values or classification for static images. Later, another concurrent method, TextGrad [52], is proposed under a similar idea, which integrates the process of incorporating textual feedback from LLMs for improving prompts
in PyTorch and further proves its effectiveness in coding, question answering, and optimization in chemistry and
medicine. Compared to existing works, our work pioneers
verbalized learning for the VAD task and video data, which
remains unsolved for previous verbalized learning frameworks focus on tasks with static input data and cannot handle the challenges of temporality and scene dynamics in the
input for a complex visual reasoning task like VAD. Specifically, VERA introduces a new learning paradigm for VAD:
generating effective questions that encapsulate key abnormal patterns in videos to elicit the reasoning ability from
VLMs for explainable VAD. Additionally, VERA works for
any VAD dataset and supports WS learning. Unlike previous WS methods, VERA only needs to learn concise text
but not millions of parameters, so the training is lightweight.
3.2. Training in VERA: Finding Guiding Questions
for VAD via Verbalized Learning
Training Objective. We aim to learn guiding questions that
break down a complex and ambiguous concept (i.e., what
is an “anomaly”) into a set of identifiable anomalous patterns to unlock reasoning capabilities within frozen VLMs
for VAD tasks. Those patterns vary among datasets, making manually designed descriptions ineffective for generalization. To address this, we propose a general verbalized
learning framework shown in Fig. 2 to generate the desired guiding questions. We denote the guiding question
set as Q = {q1 , . . . , qm }, where qi is the i-th question
(1 ≤ i ≤ m) and m is the number of questions. The training
framework considers Q as the learnable parameters, which
are optimized through verbal interaction between a learner
and an optimizer, modeled by VLMs through leveraging
their ability to follow instructions with given prompts.
Training Data. The training data for learning Q consist of
paired sampled video frames and video-level labels. Sampling is necessary because the amount of video frames is
so huge that we cannot compute with every frame. We explore three types of sampling strategies and find that uniform sampling [57] yields the best results. We will use it for
3
Guiding Questions 𝑸𝒕 (Learnable Parameters)
Input Video Frames 𝑉! (")
Forward Pass for Prediction
Backward Pass for Optimization
1. Are there any unusual movements of the vehicles in the video frames?
2. Are there any unusual behaviors of the people in the video frames?
3. Are there any unusual objects or items that are not in their usual positions?
4. Are there any objects or people that appear out of place in the scene?
5. Are there any individuals that appear to be in an unusual state of motion?
Ground Truth : 𝑌 (")
𝜃
())
Learner 𝑓,-"./-.
Update
Batched Inputs, Predictions,
and Ground Truths
𝑉!"#$% , 𝑌%!"#$% , 𝑌!"#$%
Prediction 𝑌% (0)
())
Optimizer 𝑓&'#
𝜓
Learner Prompt Template 𝜃
Optimizer Prompt Template 𝜓
You are the model.
Model Description: You are designed to do binary
classification. The input is a sequence of video frames for
identifying whether there is an anomaly in the video; you
need to output the class label, i.e., an integer in the set {0,
1}. 0 represents normal video, and 1 represents abnormal
video. Please answer the prompt questions.
Prompt Questions: Answer the following questions based
on what you see from the video frames and provide an
explanation in one sentence.
[Guiding Questions 𝑸𝒕]
Based on the analysis above, please conclude your answer
to “Is there any anomaly in the video?” in “Yes, there is an
anomaly” or “No, there is no anomaly”.
Input:
[Video Frames 𝑉! (") ]
Please give your output strictly in the following format:
• Answers to Prompt Questions: [Provide your analysis
by answering the questions listed in Prompt Questions.]
• Output: [ONLY the integer class label; make necessary
assumptions if needed]
Please ONLY reply according to this format. Don't give
me any other words.
You are the optimizer for a model. Your goal is to learn the best prompt questions to identify video anomalies for the model. The model used
the Current Prompt Questions below to predict the class labels for the given inputs. You are given the target labels. Please optimize the
Current Prompt Questions for better prediction.
Inputs:
[A batch of input video frames 𝑉%&'() ]
Model Description: You are designed to do binary classification. The input is a sequence of video frames for identifying whether there is an
anomaly in the video; you need to output the class label, i.e., an integer in the set {0, 1}. 0 represents normal video, and 1 represents abnormal
video. Please answer the prompt questions.
Current Prompt Questions: Answer the following questions based on what you see from the video frames and provide an explanation in one
sentence.
[Guiding Questions 𝑸𝒕]
Based on the analysis above, please conclude your answer to “Is there any anomaly in the video?” in “Yes, there is an anomaly” or “No, there
is no anomaly”.
The Model Predictions:
[A batch of predictions 𝑌&%&'() made by the learner]
[A batch of video-level ground truths 𝑌%&'() for the input video frames]
The Targets:
If the model is doing well, you can keep using the current prompt questions. However, if the model is not performing well, please update the
model by improving upon the “Current Prompt Questions”, which should result in lower classification error both on the current and the next
batch of i.i.d. data. Limit your “New Prompt Questions” to be no more than five questions! Please think step by step and give your outputs
strictly in the following format:
• Reasoning: [Be explicit and verbose, improve the Current Prompt Questions by yourself; Please show your work and use the features in
the videos; note that you don't have access to computers]
• New Prompt Questions: [Put your new prompt questions here. The questions MUST be based on the features in the input videos. Please
limit the number of questions to be at most five!]. Please ONLY reply according to this format. Don't give me any other words.
Figure 2. The overall training pipeline in VERA aims to optimize VAD guiding questions iteratively. In each iteration, the optimization is verbalized by
providing verbal instructions for the learner and optimizer to follow. They will generate predictions and new guiding questions, respectively.
the input in a learning task, and Qt , the learnable guiding questions applied in each iteration t, constitutes the
core parameters that distinguish the learner between iter(t)
ations. Meanwhile, we introduce an optimizer fopt to assess the quality of the predictions of the learner and to optimize Qt . W.l.o.g., we use the same frozen VLM fVLM
to model the optimizer. As demonstrated in Fig. 2, we provide another specific prompt template ψ for the learner to
follow to optimize Qt , so we denote the optimizer agent as
(t)
fopt (z) = fVLM (z; (ψ, Qt )), where z is its input and ψ is
the instruction to improve Qt . It is important to note that
(t)
(t)
(t)
flearner ̸= fopt because flearner follows (θ, Qt ) to conduct a
illustration here, and please refer to the experiment section
for details on other sampling methods. To illustrate, with
any video V (j) ∈ D, we first calculate the interval between
sampled frames as l = floor(Fj /S), where S is the number
of sampled frames, and floor denotes rounding down to the
nearest integer. Given l, the uniformly sampled frames from
(j) (j)
(j)
V (j) are represented by Ṽ (j) = [I1 , Il+1 , . . . , I(S−1)·l+1 ].
The label used for training is Y (j) only, resulting in training
data pairs {(Ṽ (j) , Y (j) )}N
j=1 for VERA.
Updating Q via Learner and Optimizer. Since Q are verbal expressions for specific anomaly patterns, VERA inherits the idea of VML [47] in training: optimizing languagebased parameters by verbal communication between a
learner agent flearner and an optimizer agent fopt , rather
than by numerical optimization algorithms like Adam [18].
We take an arbitrary iteration t for illustration in this section. Please refer to Algorithm 1 in Sec. A for the complete
iterative training in VERA.
(t)
learning task, while fopt follows (ψ, Qt ) to refine Qt .
Learning Task for flearner . The learner executes the “forward pass” and outputs a prediction. Recall that we only use
the original coarsely labeled information for training. Thus,
we design a binary classification task for flearner , which accounts for the temporal nature of video data, the sparsity
of anomalies, and the weak supervision in VAD datasets. In
this task, the job of the learner flearner is to produce a binary
classification prediction Ŷ (j) to determine whether there is
an anomaly in the video based on the sampled frames Ṽ (j) .
As shown in Fig. 2, we explain the task in natural language
in the “Model Description” section in θ. Guiding questions Qt are inserted in the “Prompt Questions” section in
θ to elicit reasoning of the VLM. This template design is
based on the prompt structures used in VML, with targeted
modifications to help the learner effectively address this WS
learning task. Due to the space limit, please refer to the Ap-
Learner and Optimizer. We denote any LLM-based
model as f (x; ϕ) where x represents the input data, and ϕ
denotes the natural language instructions for f to follow,
which is considered as learnable parameters in our verbalized learning framework. Specifically, Q contains parameters to be learned in VERA. As depicted in Fig. 2, in each
(t)
iteration t, the learner agent flearner is modeled by the frozen
VLM fVLM (·) used for VAD with a specific prompt template θ that guide fVLM (·) to conduct a learning task by
pondering on current guiding questions Qt . We denote the
(t)
learner agent as flearner (x) = fVLM (x; (θ, Qt )), where x is
4
Step 1: Initial Anomaly Scores via Learned Guiding Questions
pendix for detailed information on θ. Given θ and a sampled
frame set Ṽ (j) , the learner will output a prediction as
(t)
Ŷ (j) = flearner (Ṽ (j) ),
(1)
{
(j)
(t)
Prompt with Guiding
Questions (𝜃, 𝑸∗)
𝑦"%
Initial Anomaly Score
for the 𝑢-th Segment
Step 2: Ensemble Segment-Level Anomaly Scores with Scene Context
Retrieve
𝑉(
{
}
...
All Segments in 𝑉
By Eq. (4)
{𝑦"! & , ⋯ , 𝑦"! ' }
𝑦(%
Relevant Segment
Anomaly Scores
Ensemble
Score
%
%
Step 3: Frame-Level Anomaly Scoring with Temporal Context
[𝑦(" , ⋯ , 𝑦(# ]
Gaussian
Smoothing
Segment-Level
Anomaly Scores
Flattening &
Position Weighting
[𝛾"̅ , ⋯ , 𝛾#̅ ]
[𝑦,", ⋯ , 𝑦,$ ]
Smoothed Segment-Level
Anomaly Scores
Frame-Level
Anomaly Scores
Figure 3. VERA computes anomaly scores with Q∗ in three steps.
any context. To resolve it, we refine the initial segment-level
score by incorporating scene context—defined as preceding
and following segments that contain similar elements, such
as actors and background, to those in the current segment.
We measure the relevance between different video segments by the cosine similarity of their feature representations [24], extracted by a pretrained vision feature extractor
g, e.g., ImageBind [11]. For the u-th segment Vu , its similarity
with any segment Vw (1 ≤ w ≤ h) is sim(u, w) =
·ew
, where cos denotes the cosine function,
cos ||eueu||·||e
w ||
(2)
where Qt+1 is a new set of guiding questions constructed
(t)
from fopt owing to its text generation and instruction following abilities after reading ψ. Due to space constraints,
please refer to the Appendix for information about ψ. As
shown in Algorithm 1 in the Appendix, we will repeat
Eq. (1) and Eq. (2) for P iterations to optimize Q. We denote the one with the largest validation accuracy as Q∗ .
and eu = g(Vu ) and ew = g(Vw ) represent their features.
(K)
(1)
Let κu = [κu , . . . , κu ] denote the indices of the top-K
segments similar to Vu . We refine the anomaly score by
3.3. Inference in VERA: Coarse-to-Fine Anomaly
Scoring by Guiding Questions and Contexts
Given Q∗ , VERA yields fine-grained anomaly score Ŷ for
a test video V via a coarse-to-fine process shown in Fig. 3.
Step 1: Initial Anomaly Scores via Learned Guiding
Questions. We divide the video into segments and analyze each segment independently first. Following [55],
we perform equidistant frame sampling within V to obtain the set of each segment center C, resulting in C =
{I1 , Id+1 , · · · , I(h−1)·d+1 }, where d is the interval between
centers and h = floor(F/d) is the total number of segments. For each center frame I(u−1)·d+1 (1 ≤ u ≤ h), we
define a 10-second window around it as the u-th segment,
within which we uniformly sample 8 frames. We denote the
sampled frame set in the u-th segment as Vu . Next, we input
Vu in fVLM with the prompt (θ, Q∗ ) to get the initial score
ỹu = fVLM (Vu ; (θ, Q∗ )),
}
...
All Segments in 𝑉
where Ŷ
= 1 if the learner thinks there is an anomaly after skimming across the sampled frames Ṽ (j) and reasoning
through the guiding questions Qt , and otherwise, Ŷi = 0.
The optimizer executes
Optimization Step in fopt .
the “backward pass” to update the questions Qt via
a mini-batch (batch size is n).
Suppose the visual
(1)
(n)
input in a batch is Vbatch = [Ṽbatch , · · · , Ṽbatch ]
and the corresponding ground truths are Ybatch =
(1)
(n)
[Ybatch , · · · , Ybatch ]. The learner generates prediction as
(1)
(n)
Ŷbatch = [Ŷbatch , · · · , Ŷbatch ] with the current questions
Qt by Eq. (1). The optimizer will output a new set of questions Qt+1 by following the prompt ψ with batched data.
We denote the optimization step as
Qt+1 = fopt (Vbatch , Ŷbatch , Ybatch ),
Frozen VLM 𝑓!"#
𝑉( in the 𝑢-th Segment
ȳu =
K
X
(i)
exp(sim(u, κu )/τ )
ỹκ(i) · PK
,
(j)
u
i=1
j=1 exp(sim(u, κu )/τ )
(4)
where ȳu is an ensemble of initial scores of top-K video
segments relevant to Vu . Here, the initial score of each retrieved segment is weighted by a factor derived from the
cosine similarity and normalized by the Softmax function
(with τ as the temperature hyperparameter). Accordingly,
scenes with greater similarity are assigned higher weights,
making the ensemble score a more comprehensive reflection of anomalies with the video context. By applying
Eq. (4) for all segments, we obtain Ȳ = [ȳ1 , . . . , ȳh ].
Step 3: Frame-level Anomaly Scoring with Temporal
Context. Given Ȳ , we aim to incorporate temporal context to capture how events evolve over time when computing
frame-level anomaly scores, for the abnormality of an event
often depends on the timing and progression of observed
activities. To detail, we first apply Gaussian smoothing [12]
to aggregate local temporal context into the segment-level
anomaly scores. We denote the Gaussian kernel (suppose
2
the filter size is ω) as G(p) = exp( −p
) where p is the
2σ12
distance from the kernel center and σ1 is the variance. We
update segment-level scores as Γ̄ = Ȳ ∗ G = [γ̄1 , · · · , γ̄h ],
(3)
where ỹu = 1 if fVLM thinks the segment contains an
anomaly after reasoning via Q∗ with Vu , and otherwise,
ỹu = 0. By repeating Eq. (3) for each segment, we have
a segment-level initial anomaly score set Ỹ = [ỹ1 , · · · , ỹh ].
Step 2: Ensemble Segment-Level Anomaly Scores with
Scene Context. Note that the scores derived above only examine a short moment in a long video without considering
5
where ∗ is the convolution operation. Next, we integrate
global temporal context by position weighting. With Γ̄, we
flatten it into frame-level scores by assigning the score γ̄u to
each frame in the u-th segment, i.e., [I(u−1)·d+1 , · · · , Iu·d ].
We denote the frame-level score sequence after flattening
as [ρ1 , · · · , ρF ]. We then apply the Gaussian
function
to
encode position weights as w(i) = exp
−(i−c)2
2σ22
ones related to violence. This dataset is collected from
movies and YouTube videos. It has 3954 training videos
and 800 test videos (500 abnormal ones and 300 normal
ones). The total number of test frames is over 2 million
(2,335,801), and abnormal frames account for 23.07%.
The average duration of a test video is 1.62 minutes.
, where
Metrics. Following approaches in [55, 58], we evaluate
VAD performance using the Area Under the Curve (AUC)
of the frame-level Receiver Operating Characteristic (ROC)
curve, as it provides a comprehensive measure of model
performance across all thresholds. It is a comprehensive
representation for evaluating the ability of a method to distinguish between anomaly and normality across different
thresholds in VAD. As for average precision (AP), the area
under the frame-level precision-recall curve, it is another
VAD performance metric mostly used for the XD-Violence
dataset. Compared to AUC, this metric mainly focuses on
the performance of VAD method in identifying anomalous
events. In other words, AP pays attention to classifying the
anomaly correctly rather than the overall separation. We
report AP results for XD-Violence in the Appendix.
Baselines. We categorize baselines into non-explainable
approaches and explainable ones as [58] does. Nonexplainable ones are obtained by WS learning [7, 10, 17,
19, 21, 35, 38, 44–46, 53, 54, 60] and unsupervised learning [14, 28, 36, 37, 40, 41]. These non-explainable approaches cannot provide language-based explanations for
VAD and have following characteristics:
i (1 ≤ i ≤ F ) is any frame index, c = floor(F/2) is the
center frame index, and σ2 is the variance. The anomaly
score for the i-th frame is:
ŷi = w(i) · ρi .
(5)
This operation scales the score ρi , diminishing the anomaly
score for frames near the beginning and end of the event.
This helps better capture the temporal progression of
anomalies: the score gradually increases as the anomaly
reaches its peak and decreases afterward. The final scores
is denoted as Ŷ = [ŷ1 , . . . , ŷF ] after applying Eq. (5).
Explainable VAD by VERA. When using template θ embedded with Q∗ to compute Ŷ , we ask the VLM to “provide
an explanation in one sentence” when reasoning, and VLM
will explain the anomaly score it assigns afterward based on
Q∗ . Please refer to Sec. 4.4 and Sec. B.4 in the Appendix
for the demonstration of explainable VAD by VERA.
4. Experiments and Results
In this section, we present an evaluation of VERA as follows, addressing key questions of interest including: (Q1)
Does it enhance the effectiveness of frozen VLMs in VAD?
(Q2) Is its design reasonable and well-structured? (Q3)
How well does it generalize across different scenarios?
• WS learning methods [7, 10, 17, 19, 21, 35, 38, 44–
46, 53, 54, 60] usually use task-specific learning models with pretrained weights such as C3D [39], I3D [5],
VideoSwin [27], ResNet [15], and ResNext [48] to extract feature for each video segment. Based on that, they
form the training of classifiers, which output predictions
after the feature extractors, as a multiple instance learning
task, regarding the segments containing anomaly scenes
as positive bags and the others as negative bags to handle
the lack of frame-level annotations and the uncertainty of
the anomaly locations in the video. Such learning objectives can fully use the only available video-level label information and effectively improve the discriminative ability of the classifiers in the network. However, the trained
neural networks from these methods operate on highly abstract features that are hard for humans to interpret.
• Unsupervised learning methods [14, 28, 36, 37, 40, 41]
improve the discriminative ability of the models regarding anomalies and normality without any knowledge of
the video label. Note that we include one-class learning [14, 28, 40, 41] methods in this category. Unsupervised methods mostly learn reconstruction models from
unlabeled data and use reconstruction errors to distinguish normal and abnormal video frames. Another common strategy [36, 37] is introducing pseudo-labels for un-
4.1. Experimental Settings
Datasets. We conduct experiments on two large-scale VAD
datasets: (1) UCF-Crime [35] and (2) XD-Violence [45].
The details are as follows:
• UCF-Crime dataset is collected from real-world surveillance videos (128-hour long in total), covering crimerelated anomalies including abuse, arrest, arson, assault,
burglary, explosion, fighting, road accident, robbery,
shoplifting, shooting, stealing, and vandalism. The training set has 1610 videos (810 abnormal ones and 800 normal ones), while the test set has 290 videos (140 abnormal ones and 150 normal ones). The total number of
test frames is over 1 million (1,111,808), and abnormal
frames account for 7.92%. The average duration of a test
video is 2.13 minutes, which is relatively long compared
to common video datasets and serves as a benchmark.
• XD-Violence is another representative large-scale (217hour long in total) VAD dataset with 6 anomaly categories, i.e., abuse, car accident, explosion, fighting, riot,
and shooting, which defines anomalous events as the
6
Similar advantages are
also observed in Table 3 for
XD-Violence. Considering
multiple factors, including
performance, training efficiency, system integration,
and explainability, VERA
stands out as a promising
pipeline for VLMs in VAD.
labeled data and using this information to train discriminative models for VAD. Still, these methods cannot produce explainable results for VAD due to the structure gap.
For explainable approaches, we use LAVAD [55],
Holmes-VAD [58], and VADor [29] as representatives of
Pipeline 1 and Pipeline 2 shown in Fig. 1. It should be
noted that [49] does not report performance on UCF-Crime
and XD-Violence. Additionally, we include zero-shot (ZS)
VAD by frozen VLMs designed by [55] as baselines.
Implementation of VERA. In our experiments, we choose
a small VLM, InternVL2-8B [8], as the backbone fVLM
for building VERA by default, if not otherwise specified.
With this choice, we implement VERA on an NVIDIA
RTX A6000 GPU. We also explore other backbones, such
as Qwen2-VL-7B [43] and larger model variants of InternVL2 [8] for ablation. In principle, VERA works well
with different backbones. We train Q for no more than 10
epochs, with a validation accuracy calculated every 100 iterations to determine the optimal Q∗ . The used Q∗ is given
in Fig. 5. We set n as 2, S as 8, and m as 5 for training and
include the discussion in Sec. 4.3. Refer to the Appendix
for more details on the hyperparameters in inference.
4.3. Ablation Studies
AUC
Method
Non-Explainable VAD Methods
Hasan et al. [14]
50.32
Lu et al. [28]
53.56
BODS [41]
57.32
61.56
GODS [41]
RareAnom [36]
68.33
Explainable VAD Methods
LAVAD [55]
85.36
ZS CLIP [55]
38.21
58.81
ZS IMAGEBIND-I [55]
ZS IMAGEBIND-V [55]
55.06
LLAVA-1.5 [22]
79.62
VERA
88.26
We perform necessary abla- Table 3. AUC (%) on XD-Violence.
tion studies on UCF-Crime
to answer both Q2 and Q3 for a comprehensive evaluation.
Training Frame Sampling
AUC (%)
Strategy
Strategy. We compare three
Random [3]
83.63
frame sampling strategies for
TSN [42]
82.63
obtaining each Ṽ (j) in training:
Uniform [57]
86.55
uniform sampling,
random Table 4. Sampling strategies
sampling, and TSN sampling explored in VERA training.
(random sampling from equally
divided segments). Table 4 shows that uniform sampling
performs the best (with batch size n = 2 and S = 8). This is
because uniform sampling preserves the temporal structure
and maintains consistent motion patterns throughout the
long video, making it easier for VLMs to understand the
video and update Q.
Batch Size and Sampled Frame Number. Key hyperparameters that need to be set in training are the batch size n
and the number of sampled frames S for each video V (j) in
the verbalized learning framework. The selection of S and
n are correlated because they determine the total number of
frames for the optimizer to skim and provide feedback as
S · n. In implementation, we will face memory constraints
when implementing VLMs on GPUs. In our training, we
find in the general case fVLM used for training can handle at most 16 frames when we implement VERA on an
NVIDIA RTX A6000 GPU, so we set S · n = 16 in training. We further explore the trade-off between S and n given
the constraints for input frames to decide S and n.
The results are shown
Batch Size Sampled Frames AUC (%)
in Table 5. If the batch
n=1
S = 16
81.53
size n is 1 with S =
n=2
S=8
86.55
16, the learned quesn=4
S=4
83.19
n=8
S=2
79.91
tions cannot be generalized due to the limited Table 5. The choice of batch size and
video sample in the batch sampling frames affects the effectivewhich leads to a subop- ness of the learned guiding questions
in VAD. The results are obtained by
timal AUC, and it takes InternVL2-8B as VERA’s backbone.
longer to train for VERA.
Meanwhile, if we set n as large numbers like 4 or 8 (with
S = 4 or S = 2), the learned questions are suboptimal too
because relatively few sampled frames generally lack the
4.2. Comparison to State-of-the-art Methods
AUC
We address Q1 by empiri- Method
cally comparing VERA to Non-explainable VAD Methods
82.44
existing VAD methods. First, Wu et al. [45]
OVVAD [46]
86.40
in Table 2, VERA achieves
S3R [44]
85.99
the highest AUC among RTFM [38]
84.30
explainable VAD methods on MSL [21]
85.62
UCF-Crime, outperforming MGFN [7]
86.98
87.43
Holmes-VAD and VADor SSRL [19]
87.58
(without instruction tuning, CLIP-TSA [17]
Sultani et al. [35]
77.92
as reported in their papers)
GCL [54]
79.84
in a fair comparison. Impor- GCN [60]
82.12
tantly, unlike these methods, MIST [10]
82.30
VERA does not need to CLAWS [53]
83.03
84.50
modify the model parameters, DYANNET [37]
66.85
demonstrating its suitability Tur el al. [40]
GODS [41]
70.46
to directly adapt VLM to the
Explainable VAD Methods
VAD task with minimal trainLAVAD [55]
80.28
ing requirements. Moreover, Holmes-VAD [58]
84.61
VERA surpasses LAVAD by VADor [29]
85.90
6% in AUC on UCF-Crime, ZS CLIP [55]
53.16
uniquely integrating both ZS IMAGEBIND-I [55] 53.65
description and reasoning ca- ZS IMAGEBIND-V [55] 55.78
72.84
pabilities in VAD. Compared LLAVA-1.5 [22]
VERA
86.55
to non-explainable methods,
VERA achieves AUC per- Table 2. AUC (%) on UCF-Crime.
No instruction tuning is used for
formance that is comparable Holmes-VAD and VADor.
to one of the top-performing
methods, CLIP-TSA, on UCF-Crime, while offering the
additional advantage of explainable predictions.
7
temporality for the optimizer to look into the details and
conceive good questions. Thus, setting n to 2 and S to 8 is
in default in this paper, which strikes the balance between
training efficiency and effectiveness.
VLM architectures, and datasets to address Q3.
First, we apply VERA
Source of Q
fVLM
InternVL2-8B InternVL2-40B
to InternVL2-40B, a
86.55
80.43
larger model in the InternVL2-8B
86.72
InternVL2 family com- InternVL2-40B 85.24
Table
8.
AUC
(%)
across
model
sizes .
pared to InternVL2-8B.
Source of Q
As shown in TafVLM
InternVL2-8B Qwen2-VL-7B
ble 8, InternVL2-40B
86.55
81.37
achieves
effective InternVL2-8B
Qwen2-VL-7B
79.60
82.64
AUC
performance,
Table 9. AUC (%) across architectures.
slightly exceeding that
Source of Q
of InternVL2-8B, indiDataset
UCF-Crime XD-Violence
cating that verbalized
UCF-Crime
86.55
80.42
learning in VERA en- XD-Violence 86.26
88.26
ables models of various
Table 10. AUC (%) across datasets.
scales to identify a Q
suitable for their reasoning capabilities. Additionally, We
also evaluate the transferability of Q across different scales
and and observe an interesting phenomenon: the Q learned
by InternVL2-8B remains effective for InternVL2-40B,
but not vice versa. This is likely because the Q learned
by the smaller model is readily interpretable by the larger
model, whereas the Q derived from the larger model is
more complex in syntactic structure and does not align
well with the reasoning framework of the smaller model.
Secondly, we select a different VLM, Qwen2-VL-7B [43],
as the backbone for VERA. As shown in Table 9, while the
AUC achieved with Qwen2-VL-7B is lower than that with
InternVL2-8B, the verbalized learning in VERA remains
effective, allowing it to outperform notable baselines such
as LAVAD [55]. However, a notable gap exists when
transferring Q across different model architectures in
Table 9. Developing a universal Q that can effectively
elicit reasoning capabilities across various VLM structures
would be an promising direction for future research.
Lastly, we observe that the transferability of Q depends
on the training dataset. From Table 10, we observe that
transferring Q learned from UCF-Crime to XD-Violence
results in a smaller performance drop compared to the
reverse case. This suggests the source dataset is crucial to
the transferability of Q across datasets.
AUC (%)
Question Type
No questions
78.81
Manually written questions by human
81.15
Learned questions w/o iteratively inputting Vbatch in Eq. (2) 78.06
Iteratively learned questions (used in VERA)
86.55
Table 6. The way we obtain guiding questions affects AUC substantially.
AUC (%)
How to Obtain Guiding Questions Q for VLM. As seen
in Table 6, if the guiding questions are not incorporated into
the VLM prompt, the AUC will drop largely to 78.81%,
confirming the need to use simpler and more focused questions to provoke reasoning in the VLMs for VAD. Meanwhile, if we use manually written questions (detailed in the
Appendix), the performance is suboptimal with an 81.15%
AUC, which shows the need to use verbalized learning to
find guiding questions. Lastly, if we only input batched predictions Ŷbatch and ground truths Ybatch without inputting
Vbatch in the optimizer, the Q updated in this way will dumb
the VLMs and make it have a low AUC. Thus, inputting
video frames as Eq. (2) does is necessary to learn good Q.
Number of Questions m.
86.55
85.79
86
0 question
As shown in Fig. 4, when
83.76
84
m is set to 1, the rea82
soning is limited to a sin80
gle perspective, resulting
78.81
78 77.31
in a lower AUC. As m increases up to 5, the model
76
1
3
5
7
captures more compreNumber of Questions
hensive anomaly patterns, Figure 4. Effect of the number of guidleading to improved AUC. ing questions on AUC.
However, increasing m
beyond 5 yields no significant gains. Therefore, we set m
to 5 by default in VERA, if not otherwise specified.
Operation
Initial (Step 1)
Initial + Retrieval (Step 2)
Initial + Retrieval + Smoothing (Step 3)
Initial + Retrieval + Smoothing + Weighting (Step 3)
AUC (%)
76.10
84.53 (+8.43)
85.48 (+0.95)
86.55 (+1.07)
Table 7. Ablation study of each step in VERA inference.
4.4. Qualitative Results and Case Studies
Coarse-to-Fine Anomaly Score Computation. We also
validate the anomaly score computation by VERA. Table 7
shows the AUC is 76.10% when using the flattened initial
score obtained in Step 1, and leveraging retrieved segments
in Step 2 significantly boosts the AUC to 84.53%, highlighting the effectiveness of incorporating ensemble scores
based on scene context. Meanwhile, smoothing and weighting in Step 3 further improves the AUC by around 1% each,
verifying the benefit of integrating temporal context.
Generalizability Test. We further examine the generalizability of VERA across different model sizes,
To illustrate how VERA performs video anomaly detection,
we take one video for a qualitative demonstration of the explainability brought by the learned Q, as shown in Fig. 5.
Please refer to the Appendix for more qualitative examples if interested. The main anomaly in this video is that
a man tries to steal money from the washing machines in
a laundromat and is arrested after being found by the police. In Fig. 5, we provide the guiding questions learned
by VERA and take 6 main video segments (each with 2
sampled frames and their time indices are given in Fig. 6)
8
Learned Guiding Questions 𝑸∗ in VERA
1. Are there any people in the video who are ①
... ②
... ③
not in their typical positions or engaging in
activities that are not consistent with their
usual behavior?
🤖: The video shows a person
🤖: The man is not in his
🤖: The person in the video is
2. Are there any vehicles in the video that are
not in their typical positions or being used
in a way that is not consistent with their
usual function?
walking in a hallway and
entering a room. There are no
people in unusual positions or
engaging in activities that are
not consistent with their usual
behavior.
3. Are there any objects in the video that are
not in their typical positions or being used
in a way that is not consistent with their ④
usual function?
4. Is there any visible damage or unusual
movement in the video that indicates an
anomaly?
5. Are there any unusual sounds or noises
in the video that suggest an anomaly?
typical position or engaging
in activities consistent with his
usual behavior. He is bending
over and interacting with the
machines in a way that is not
typical.
... ⑥
... ⑤
🤖: The person in the video is
not in their typical position or
engaging in activities consistent
with their usual behavior. They
are lying on the floor, which is
not typical behavior for
someone in this setting.
not in a typical position or
engaging in activities that are
not consistent with their usual
behavior. They are bending
over and reaching into the
machine, which is not a
common action for them.
🤖: The video shows a person
lying on the floor and being
attended
to
by
two
individuals, which is not
consistent with their usual
behavior.
🤖: The video shows a person
being restrained by two
individuals in a room. This
activity is not consistent with
typical behavior in the
environment.
Figure 5. Given Q∗ by VERA, the frozen VLM (InternVL2-8B) will reason and explain the scene based on it. For illustration, we take as an example the
video “Arrest007 x264” from UCF-Crime and include 6 scenes here. The complete anomaly scores are shown in Fig. 6.
Anomaly Score
as examples to demonstrate the explanation produced by
VERA. From every given answer, the frozen VLM with
VERA-learned questions is able to explain the scene by
closely following the detailed anomaly characterization of
the five learned guiding questions. For example, in the second scene, one question in Q∗ states that “Are there any
people in the video who are not in their typical positions or
engaging in activities that are not consistent with their usual
behavior”, and it successfully triggers the reasoning abilities from the frozen VLM. The VLM then accurately describes the abnormal event and explains why it is regarded
as an anomaly under the cue from the question.
Moreover, owing to
1.0
③④ ⑤⑥
①②
the proposed coarse-to0.8
fine detection strategy
0.6
in testing, the anomaly
0.4
score dynamics shown
0.2
in Fig. 6 well repre0
sents the actual real-time
0
500 1000 1500 2000 2500 3000
Frame Index
anomaly level in this
video and gradually in- Figure 6. Anomaly scores generated
by VERA (with InternVL2-8B) in “Arcreases to nearly 1 when rest007 x264” from UCF-Crime.
the man is being arrested.
This result verifies that VERA allows VLMs to effectively
identify anomalies with a holistic model, reducing the manpower and computational overhead for explainable VAD.
More interestingly, we want to highlight one more advantage of VERA. That is, VERA allows humans to further interact with VLMs because it retains the general
question-answering ability of pretrained VLMs. This is because VERA does not require finetuning of the VLM backbone weights. Although finetuning VLMs with parameterefficient methods like [16, 26, 32] is easy and computationally tractable, instruction-tuned models still inevitably
lose the flexibility to handle general questions (due to catas-
🤖: The video shows a person walking in a
hallway and entering a room. There are no
people in unusual positions or engaging in
activities that are not consistent with their usual
behavior. Based on the analysis, there is no
anomaly in the video.
…
Segment ① in the example video
🙋: (Follow-up Question) This is a laundromat with washing machines. Why this
behavior is normal?
🤖: The behavior in the video is normal because it depicts a person walking in a
hallway and entering a room that appears to be a laundromat with washing machines.
The person's actions, such as walking and entering the room, are consistent with
typical behavior in a laundromat setting.
🙋: (Counterfactual Question) Please infer with this hypothetical scenario and
answer in a few sentences: what if the man is stealing from the machine?
🤖: If the man in the video is stealing from the washing machines, this behavior would
be considered an anomaly. Stealing is an illegal and unethical act that is not
consistent with typical behavior in a laundromat setting. The presence of such
behavior would indicate a violation of the law and a breach of trust, making it an
anomaly in the video.
Figure 7. Humans can further interact with VERA (Backbone: InternVL28B) and ask open-ended questions.
This part contains
a scene where the
suspect is taken
away by the police
trophic forgetting), as they are trained to respond to certain
queries with fixed answer styles. In contrast, as shown in
Fig. 7, the learned Q∗ can steer reasoning in a frozen VLM
while still being able to allow the VLM to answer openended (like follow-up or counterfactual) questions, which is
an important ability lost in instruction tuning-based models.
5. Concluding Remarks and Limitations
We propose a novel pipeline, VERA, which can effectively
elicit the reasoning ability from VLMs to perform explainable VAD without additional computation overhead. This
is done through an effective and novel application of verbalized machine learning [47] to VLM. In training, VERA
obtains the guiding questions detailing anomaly patterns
through the verbal interaction between the learner and the
optimizer agents. In inference, VERA uses them to enhance
VLMs for identifying anomalies and compute frame-level
anomaly scores in a coarse-to-fine process. Experimental
results validate the effectiveness of the VERA framework
in achieving state-of-the-art explainable VAD performance.
9
Like existing VLM-based VAD methods, VERA’s performance relies heavily on the visual perception capabilities of VLMs. Most VLMs employ the CLIP vision encoder [33], which has limitations in capturing fine-grained
visual details. This limitation can impair precise anomaly
detection. If important visual features are missing during
the visual encoding process, then it is unlikely for VERA to
perform meaningful verbalized learning. Therefore, a fundamental challenge for VLM-based VAD is to ensure sufficient visual and temporal features are encoded. Having verified this capability, VERA can perform verbalized learning
to extract crucial cues that guide video anomaly reasoning.
worth 16x16 words: Transformers for image recognition at
scale. In ICLR, 2021. 1
[10] Jia-Chang Feng, Fa-Ting Hong, and Wei-Shi Zheng. Mist:
Multiple instance self-training framework for video anomaly
detection. In CVPR, 2021. 6, 7
[11] Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat
Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan
Misra. Imagebind: One embedding space to bind them all.
In CVPR, 2023. 5, 18
[12] Rafael C Gonzalez. Digital image processing. Pearson education india, 2009. 5
[13] Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin
Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei
Liu. Regiongpt: Towards region understanding vision language model. In CVPR, 2024. 3
[14] Mahmudul Hasan, Jonghyun Choi, Jan Neumann, Amit K
Roy-Chowdhury, and Larry S Davis. Learning temporal regularity in video sequences. In CVPR, 2016. 6, 7, 14
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition. In CVPR,
2016. 6
[16] Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li,
Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank
adaptation of large language models. In ICLR, 2021. 9
[17] Hyekang Kevin Joo, Khoa Vo, Kashu Yamazaki, and Ngan
Le. Clip-tsa: Clip-assisted temporal self-attention for
weakly-supervised video anomaly detection. In ICIP, 2023.
6, 7, 17
[18] Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 4
[19] Guoqiu Li, Guanxiong Cai, Xingyu Zeng, and Rui Zhao.
Scale-aware spatio-temporal relation learning for video
anomaly detection. In ECCV, 2022. 6, 7
[20] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
Blip-2: Bootstrapping language-image pre-training with
frozen image encoders and large language models. In ICML,
2023. 1
[21] Shuo Li, Fang Liu, and Licheng Jiao. Self-training multisequence learning with transformer for weakly supervised
video anomaly detection. In AAAI, 2022. 6, 7, 17
[22] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee.
Improved baselines with visual instruction tuning. In CVPR,
2024. 7, 17
[23] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee.
Visual instruction tuning. In NeurIPS, 2024. 1, 3
[24] Weiyang Liu, Yan-Ming Zhang, Xingguo Li, Zhiding Yu, Bo
Dai, Tuo Zhao, and Le Song. Deep hyperspherical learning.
In NeurIPS, 2017. 5
[25] Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. Future frame prediction for anomaly detection–a new baseline.
In CVPR, 2018. 3
[26] Weiyang Liu, Zeju Qiu, Yao Feng, Yuliang Xiu, Yuxuan
Xue, Longhui Yu, Haiwen Feng, Zhen Liu, Juyeon Heo,
Songyou Peng, et al. Parameter-efficient orthogonal finetuning via butterfly factorization. In ICLR, 2024. 9
[27] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang,
Stephen Lin, and Han Hu. Video swin transformer. In CVPR,
2022. 6
References
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida,
Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
2023. 1
[2] Daniel Bogdoll, Maximilian Nitsche, and J Marius Zöllner.
Anomaly detection in autonomous driving: A survey. In
CVPR Workshops, 2022. 1
[3] Meinardus Boris, Batra Anil, Rohrbach Anna, and Rohrbach
Marcus. The surprising effectiveness of multimodal large
language models for video moment retrieval. arXiv preprint
arXiv:2406.18113, 2024. 7
[4] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom
Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler,
Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen,
Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess,
Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models
are few-shot learners. In NeurIPS, 2020. 1
[5] Joao Carreira and Andrew Zisserman. Quo vadis, action
recognition? a new model and the kinetics dataset. In CVPR,
2017. 6
[6] Junxi Chen, Liang Li, Li Su, Zheng-jun Zha, and Qingming Huang. Prompt-enhanced multiple instance learning
for weakly supervised video anomaly detection. In CVPR,
2024. 3
[7] Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton
Fok, Xiaojuan Qi, and Yik-Chung Wu. Mgfn: Magnitudecontrastive glance-and-focus network for weakly-supervised
video anomaly detection. In AAAI, 2023. 6, 7, 17
[8] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen,
Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu,
Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In CVPR,
2024. 1, 2, 7
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is
10
Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s
perception of the world at any resolution. arXiv preprint
arXiv:2409.12191, 2024. 7, 8
[44] Jhih-Ciang Wu, He-Yen Hsieh, Ding-Jie Chen, Chiou-Shann
Fuh, and Tyng-Luh Liu. Self-supervised sparse representation for video anomaly detection. In ECCV, 2022. 6, 7, 17
[45] Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao,
Zhaoyang Wu, and Zhiwei Yang. Not only look, but also
listen: Learning multimodal violence detection under weak
supervision. In ECCV, 2020. 2, 3, 6, 7, 17
[46] Peng Wu, Xuerong Zhou, Guansong Pang, Yujia Sun, Jing
Liu, Peng Wang, and Yanning Zhang. Open-vocabulary
video anomaly detection. In CVPR, pages 18297–18307,
2024. 6, 7, 14, 17
[47] Tim Z Xiao, Robert Bamler, Bernhard Schölkopf, and
Weiyang Liu. Verbalized machine learning: Revisiting
machine learning with language models. arXiv preprint
arXiv:2406.04344, 2024. 3, 4, 9
[48] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and
Kaiming He. Aggregated residual transformations for deep
neural networks. In CVPR, 2017. 6
[49] Yuchen Yang, Kwonjoon Lee, Behzad Dariush, Yinzhi Cao,
and Shao-Yuan Lo. Follow the rules: reasoning for video
anomaly detection with large language models. arXiv
preprint arXiv:2407.10299, 2024. 1, 3, 7
[50] Zhiwei Yang, Jing Liu, and Peng Wu. Text prompt with normality guidance for weakly supervised video anomaly detection. In CVPR, 2024. 3
[51] Muchao Ye, Xiaojiang Peng, Weihao Gan, Wei Wu, and Yu
Qiao. Anopcn: Video anomaly detection via deep predictive
coding network. In ACM international conference on multimedia, 2019. 3
[52] Mert Yuksekgonul, Federico Bianchi, Joseph Boen, Sheng
Liu, Zhi Huang, Carlos Guestrin, and James Zou. Textgrad:
Automatic “differentiation” via text.
arXiv preprint
arXiv:2406.07496, 2024. 3
[53] Muhammad Zaigham Zaheer, Arif Mahmood, Marcella
Astrid, and Seung-Ik Lee. Claws: Clustering assisted weakly
supervised learning with normalcy suppression for anomalous event detection. In ECCV, 2020. 6, 7
[54] M Zaigham Zaheer, Arif Mahmood, M Haris Khan, Mattia Segu, Fisher Yu, and Seung-Ik Lee. Generative cooperative learning for unsupervised video anomaly detection. In
CVPR, 2022. 6, 7
[55] Luca Zanella, Willi Menapace, Massimiliano Mancini, Yiming Wang, and Elisa Ricci. Harnessing large language models for training-free video anomaly detection. In CVPR,
2024. 1, 3, 5, 6, 7, 8, 17, 18
[56] Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun
Qing, Qingming Huang, and Ming-Hsuan Yang. Exploiting
completeness and uncertainty of pseudo labels for weakly
supervised video anomaly detection. In CVPR, 2023. 3
[57] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An
instruction-tuned audio-visual language model for video understanding. In EMNLP, 2023. 3, 7
[28] Cewu Lu, Jianping Shi, and Jiaya Jia. Abnormal event detection at 150 fps in matlab. In ICCV, 2013. 3, 6, 7
[29] Hui Lv and Qianru Sun. Video anomaly detection and
explanation via large language models. arXiv preprint
arXiv:2401.05702, 2024. 1, 2, 3, 7
[30] Hui Lv, Zhongqi Yue, Qianru Sun, Bin Luo, Zhen Cui, and
Hanwang Zhang. Unbiased multiple instance learning for
weakly supervised video anomaly detection. In CVPR, 2023.
3
[31] Sarah Pratt, Ian Covert, Rosanne Liu, and Ali Farhadi. What
does a platypus look like? generating customized prompts
for zero-shot image classification. In ICCV, 2023. 3
[32] Zeju Qiu, Weiyang Liu, Haiwen Feng, Yuxuan Xue, Yao
Feng, Zhen Liu, Dan Zhang, Adrian Weller, and Bernhard
Schölkopf. Controlling text-to-image diffusion by orthogonal finetuning. In NeurIPS, 2023. 9
[33] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In ICML, 2021. 10
[34] Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard
Schölkopf, Thomas Brox, and Peter Gehler. Towards total
recall in industrial anomaly detection. In CVPR, 2022. 1
[35] Waqas Sultani, Chen Chen, and Mubarak Shah. Real-world
anomaly detection in surveillance videos. In CVPR, 2018. 2,
3, 6, 7, 17
[36] Kamalakar Vijay Thakare, Debi Prosad Dogra, Heeseung
Choi, Haksub Kim, and Ig-Jae Kim. Rareanom: A benchmark video dataset for rare type anomalies. Pattern Recognition, 140:109567, 2023. 6, 7
[37] Kamalakar Vijay Thakare, Yash Raghuwanshi, Debi Prosad
Dogra, Heeseung Choi, and Ig-Jae Kim. Dyannet: A scene
dynamicity guided self-trained video anomaly detection network. In WACV, 2023. 6, 7
[38] Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh,
Johan W Verjans, and Gustavo Carneiro. Weakly-supervised
video anomaly detection with robust temporal feature magnitude learning. In ICCV, 2021. 6, 7, 17
[39] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani,
and Manohar Paluri. Learning spatiotemporal features with
3d convolutional networks. In ICCV, 2015. 6
[40] Anil Osman Tur, Nicola Dall’Asen, Cigdem Beyan, and
Elisa Ricci. Unsupervised video anomaly detection with diffusion models conditioned on compact motion representations. In International Conference on Image Analysis and
Processing, 2023. 3, 6, 7
[41] Jue Wang and Anoop Cherian. Gods: Generalized one-class
discriminative subspaces for anomaly detection. In ICCV,
2019. 3, 6, 7
[42] Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua
Lin, Xiaoou Tang, and Luc Van Gool. Temporal segment
networks: Towards good practices for deep action recognition. In ECCV, 2016. 7
[43] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan,
Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin
Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui
11
[58] Huaxin Zhang, Xiaohao Xu, Xiang Wang, Jialong Zuo,
Chuchu Han, Xiaonan Huang, Changxin Gao, Yuehuan
Wang, and Nong Sang. Holmes-vad: Towards unbiased and
explainable video anomaly detection via multi-modal llm.
arXiv preprint arXiv:2406.12235, 2024. 1, 2, 3, 6, 7, 17
[59] Menghao Zhang, Jingyu Wang, Qi Qi, Haifeng Sun, Zirui
Zhuang, Pengfei Ren, Ruilong Ma, and Jianxin Liao. Multiscale video anomaly detection by multi-grained spatiotemporal representation learning. In CVPR, 2024. 3
[60] Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu,
Thomas H Li, and Ge Li. Graph convolutional label noise
cleaner: Train a plug-and-play action classifier for anomaly
detection. In CVPR, 2019. 6, 7
[61] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language
understanding with advanced large language models. In
ICLR, 2024. 1
12
Appendix
Table of Contents
A. Training in VERA
A.1. Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A.2. Details for Iterative Update by the Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
14
15
B. Additional Experiments and Results
B.1 . Comparison to the State-of-the-art Methods on XD-Violence Measured by AP . . . . . . . . . . . . . . . . .
B.2 . Discussion on Generalizability of Used Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.3 . Hyperparameters in Inference and Sensitivity Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
B.4 . Additional Qualitative Results & Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
17
17
18
19
13
We include more details on training in VERA (Sec. A) and additional experimental results (Sec. B). To specify:
• In Sec. A, we provide the pseudocodes and details on the initialization, the learner prompt template, and the optimizer
prompt template for the training process in Sec. A.1. After that, we discuss the optimization process of the learned questions
by the optimizer in Sec. A.2.
• In Sec. B, we first include comparison results with the state-of-the-art methods on XD-Violence measured by AP in
Sec. B.1. We also discuss other good properties of VERA, including the good generalizability of the learned questions
for different scenarios and the insensitivity of VERA regarding hyperparameters in Sec. B.2 and Sec. B.3, respectively.
Finally, we include additional case studies with normal and abnormal videos in Sec. B.4.
A. Training in VERA
A.1. Algorithm
We show the complete iterative training process of VERA in pseudocodes in Algorithm 1. It is an iterative process of using the
learner to output binary prediction for each sample in a mini-batch and asking the optimizer to update the guiding questions
after collecting the batched data. Meanwhile, we have a small validation set (10% samples randomly drawn from the original
training set) for deciding the Q∗ used for testing. We want to further detail on certain elements in Algorithm 1 as follows.
Algorithm 1: Optimizing Guiding Questions in VAD by VERA during Training
Inputs: Training data pairs Dtrain = {(Ṽ (j) , Y (j) )}N
j=1 , iteration number P , initial guiding questions Q0 , learner flearner ,
(j)
(j)
optimizer fopt , learner prompt template θ, optimizer prompt template ψ, validation set Dval = {(Ṽval , Yval )}ηj=1 , period for
validation µ, batch size n.
Output: Optimal guiding questions Q∗ .
Set iteration counter t ← 1;
Set Q∗ ← Q0 , test Q0 on validation set Dval and compute its validation accuracy as Acc∗ ;
while t ≤ P do
# Conduct the learning task with a mini-batch by the learner
(1)
(n)
Randomly sample a batch without repetition from Dtrain with a visual input batch Vbatch = [Ṽbatch , · · · , Ṽbatch ] and ground
(1)
(n)
truths Ybatch = [Ybatch , · · · , Ybatch ];
for 1 ≤ j ≤ n do
(j)
(j)
(j)
(t)
(j)
Obtain a prediction Ŷbatch for Ṽbatch from flearner with prompt (θ, Qt ) by Eq. (1) as Ŷbatch = flearner (Ṽbatch );
end
# Update the guiding questions with the batched data by the optimizer
(1)
(n)
Input the batched prediction Ŷbatch = [Ŷbatch , · · · , Ŷbatch ] with Vbatch and Ybatch into the optimizer for obtaining a new set of
(t)
guiding questions by Eq. (2) as: Qt+1 = fopt (Vbatch , Ŷbatch , Ybatch );
# Compute the validation accuracy with the learned guiding questions periodically
t ← t + 1;
if t mod µ = 0 then
Test Qt on the validation set Dval and compute the validation accuracy Acct ;
if Acct > Acc∗ then
Update Q∗ ← Qt ;
Update ACC∗ ← ACCt ;
end
end
end
Return Q∗ ;
Initial Q0 . The initial guiding questions Q0 are “1. Is there any suspicious person or object that looks unusual in this
scene? 2. Is there any behavior that looks unusual in this scene?”. These two questions are manually written and inspired
by previous VAD methods, which assume anomaly as something or somebody with unusual appearance or motions [14, 46].
This set of questions is also the “manually written questions by human” in Table 6, which is suboptimal in guiding frozen
VLMs to detect anomalies. The key idea of training is to use verbalized learning to iteratively update Q given a suboptimal
Q0 .
Learner Prompt Template θ. We detail the design of θ as follows. As shown in Fig. 2, the learner prompt template θ
includes four sections, i.e., Model Description, Prompt Questions, Input, and Output Formatting. To specify:
14
• Model Description: This section introduces the learning task, providing the learner with the necessary background knowledge to understand the objective. It clarifies what the learner is expected to predict based on the given visual input data.
• Prompt Questions: This section presents a general prompt to guide the learner’s reasoning process. Specific prompts,
denoted as Qt , will be inserted here to facilitate reasoning within a frozen VLM.
• Input: This section simply stores the visual tokens. When the VLM reads this, it will correlate the read text with the visual
inputs.
• Output Formatting: The last section in θ mainly provides information on output formats to ensure that VLMs think through
the given questions Qt and output a prediction in a format easy for post-processing in computers.
Optimizer Prompt Template ψ. As shown in Fig. 2, the optimizer prompt template includes seven sections, i.e., Instruction,
Inputs, Model Description, Current Prompt Questions, Model Predictions & Targets, and Optimization Instruction:
• Instruction: The prompt template begins with an introduction outlining the responsibilities of the optimizer, clearly stating
that its primary task is to optimize the guiding questions provided.
• Inputs: This section is used to attach the batched visual data for the reference of the optimizer.
• Model Description: The learning task of the learner is reiterated here for the information of the optimizer.
• Current Prompt Questions: The guiding questions used by the learner in the current iteration are shown here for the
reference of the optimizer.
• Model Predictions & Targets: The batched numerical predictions and the ground truths are shown here for fopt . These two
inputs can tell the optimizer how well the learner does in the learning task on the mini-batch data.
• Optimization Instruction: The final section includes the instruction to ask the optimizer to think step by step with all the
information above and output a new set of prompt questions with the required format.
A.2. Details for Iterative Update by the Optimizer
Validation Accuracy (%)
0.72
In training, we assess the quality of the learned guiding questions by the accuracy
of the validation set. We show the validation accuracy from different questions Qt
0.7
obtained every 100 iterations (mini-batches) in Fig. 8. In the duration of up to 5000
0.68
iterations in training, the observed plot in Fig. 8 contains three oscillations, each con0.66
sisting of an increase in validation accuracy followed by a decrease. The increase rep0.64
resents that the optimizer VLM gradually finds better questions for the binary classi0.62
fication learning task when it sees more batched data, which shows the optimizer can
0.6
0
10
20
30
40
50
understand its responsibility well and find better questions effectively. Meanwhile,
Iteration ( 100)
we note that verbal optimization may not always lead to an increase. This is probably Figure 8. The validation accuracy given difbecause the optimization is completely verbalized, and the VLM will have an inertial ferent learned guiding questions from each itthinking behavior like humans, which gets the optimizer stuck in the wrong direction eration. The graph is smoothed with moving
and makes it continue the optimization in a direction that is not beneficial. As a re- average (window size 5) for better readability.
sult, this causes the validation accuracy to decrease sometimes. Despite that, because
of the guidance provided by the optimizer prompt template ψ, the optimizer can overcome its pitfalls in thinking and find
good guiding questions in a new direction, which leads to an increase in validation accuracy afterward. This is an interesting
phenomenon due to the distinction between verbal learning and traditional numerical optimization algorithms, and it will be
a promising future direction to reduce the time in overcoming pitfalls in thinking for VLMs during verbalized learning.
Learned Questions after 100 Iterations
Learned Questions after 300 Iterations
1. Are there any vehicles in the video frames that are not
typically found on a street, such as a motorcycle in a store or a
backpack in a parking lot?
1. Are there any people in the video frame who are not
engaging in typical activities, such as loitering or not
interacting with the surroundings?
2. Are there any people in the video frames who are interacting
with objects in a manner that is not typical for a street setting,
such as handling a large object in a store or carrying a bag in a
parking lot?
2. Are there any objects in the video frame that are not
consistent with the typical environment, such as
unusual equipment or structures?
3. Are there any people in the video frames who are standing
or sitting in a manner that seems suspicious or out of place,
such as sitting on the ground in a store or standing in a parking
lot?
4. Are there any people in the video frames who are wearing
clothing that is not typical for a street setting, such as wearing
a uniform in a parking lot or wearing protective gear in a store?
5. Are there any people in the video frames who are in unusual
positions or postures, such as sitting on the ground or standing
in a way that is not typical for a store setting?
3. Are there any sudden or unusual movements in the
video frame that are not typical for the environment,
such as equipment malfunctions or unexpected actions?
4. Are there any people in the video frame who are not
following the normal flow of activities, such as
standing in unusual places or engaging in suspicious
behaviors?
5. Are there any objects in the video frame that are not
interacting with the environment in a typical manner,
such as being placed in unusual locations or not being
used for their intended purposes?
Learned Questions after 500 Iterations
Learned Questions after 700 Iterations
1. Are there any individuals in the video 1. Are there any frames where the
frames who are carrying unusual or presence of a specific object (e.g.,
unexpected items?
unusual item, unexpected tool) is not
typical for the scene?
2. Are there any objects in the video frames
that are not typically found in that 2. Are there any frames where the
environment and are being used in an arrangement of people or objects is
unusual or unexpected way?
unusual or unsafe?
3. Are there any individuals in the video 3. Are there any frames where an
frames who are interacting with objects in an individual is carrying an object that is
unusual or unexpected way?
not typical for the scene?
4. Are there any individuals in the video 4. Are there any frames where an
frames who are performing actions that are individual is interacting with an object
not typical for that environment?
in an unusual manner?
5. Are there any unusual or unexpected 5. Are there any frames where the
activities occurring in the video frames, such overall environment or setting is not
as interactions between objects or individuals consistent with normal conditions?
that are not typical for that environment?
Figure 9. We take the guiding questions Q learned from the 100th iteration to the 700th iteration for illustration purpose. During the updating process, the
optimizer gradually concretizes anomaly patterns that can be applied to different scenarios in a concise expression.
15
In addition, w.l.og., we take learned questions from the 100th iteration to the 700th iteration (which are within the first
epoch) for illustration to show the process of updating Q by the optimizer in Fig. 9. First, as the optimizer sees more videos,
it tries to make the questions focus on a more general setting. For example, the questions in the 100th iteration focus on
“street” and “store” scenes. After more iterations, the questions become more generalizable for a general environment and
focus on the elements that cause anomalies. Additionally, the anomalous pattern descriptions become more diverse as the
optimization continues. To illustrate, in the beginning, the questions mostly pay attention to the humans, objects, and their
interaction. In later iterations, the optimizers gradually summarize some previous questions into one and raise questions
considering the overall environment (Q5 from the 700th iteration). Therefore, the verbalized learning framework proposed
in this paper is effective in finding a diverse set of guiding questions for VAD that apply to general cases, which can elicit the
reasoning of a frozen VLM in VAD.
16
B. Additional Experiments and Results
B.1. Comparison to the State-of-the-art Methods on XD-Violence Measured by AP
The comparison results regrading average precision (AP), i.e., the area under the frame-level Method
AP
precision-recall curve, on XD-Violence are shown in Table 11. Compared to AUC, AP focuses Non-Explainable VAD Methods
78.64
on measuring the ability to identify the positive class (anomaly), while AUC measures how well Wu et al.† [45]
[46]
66.53
a method separates anomaly and normalcy in general. We provide the analysis of the results as OVVAD
80.26
S3R† [44]
follows.
RTFM† [38]
77.81
Firstly, under such a distinct property of AP, as pointed out by [46], methods trained on MSL† [21]
78.58
the whole training set and utilizing all frames will enjoy advantages when measuring VAD MGFN† [7]
80.11
performance by AP. As a result, CLIP-TSA and Holmes-VAD, two methods using the whole CLIP-TSA† [17]
82.19
Explainable VAD Methods
training frames, attain the highest AP in the category of non-explainable and explainable VAD,
84.96
respectively. We acknowledge there is a gap between VERA and these two methods under Holmes-VAD† [58]
62.01
AP on XD-Violence, which is understandable because they use the whole training frames to LAVAD [55]
ZS CLIP [55]
17.83
improve the ability to find anomalies of classifiers. To illustrate, in training VERA only sam27.25
ZS IMAGEBIND-I [55]
ples 8 frames for each video and only uses 0.19% total frames (31,632 out of 16,378,527) for ZS IMAGEBIND-V [55] 25.36
training on XD-Violence. Thus, our training is dramatically light compared to the methods LLAVA-1.5 [22]
50.26
like CLIP-TSA and Holmes-VAD in Table 11. With fewer frames used for training, VERA VERA
70.54
unavoidably achieve lower AP (which only considers positive cases) compared to those that Table 11. AP (%) on XDhave more, for it relies on fewer training data. In addition, we want to point out that judging Violence. † indicates VAD meththe VAD performance solely by AP on XD-Violence can be biased. This is because the ratio of ods are trained on entire training
frames. No instruction tuning is
positive frames in XD-Violence (23.07%) in test videos is overly higher than other datasets like used for Holmes-VAD.
UCF-Crime (7.92%), which is unrealistic because the anomaly is sparse in the real world [35].
Given that, only focusing on the comparison in AP on XD-Violence would amplify the bias in VAD performance evaluation,
and we recommend taking into consideration other factors like training costs and the comprehensive ability of distinguishing
anomaly and normality by the methods in evaluation.
Secondly, among the methods (OVVAD, LAVAD, ZS CLIP, ZS IMAGEBIND, and LLAVA-1.5) that does not use full
frames for training, VERA achieves the best AP in this fair comparison, surpassing the second best method in the Explainable
VAD category (LAVAD) over 8.53%, which showcases the effectiveness of using learned guiding question to prompt frozen
VLMs for VAD.
To conclude, it is unfair to only judge VAD performance by AP on XD-Violence without considering the training costs and
the relatively imbalanced frame distribution in test videos. Considering all factors into consideration, VERA is a favorable
method used for VAD in detecting anomalies.
B.2. Discussion on Generalizability of Used Questions
During the optimization of Q, because of the randomness involved in this process, the optimizer may output certain guiding
questions that only focus on one specific surrounding. We find an interesting phenomenon on VLMs in VAD that guiding
questions related to a specific scenario yield inferior VAD performance compared to the general questions in both general
cases and specific cases.
To illustrate, we take two sets of specific questions for analysis. The first example is a set of guiding questions Qtraffic
that only ask the VLM to consider anomalies related to the traffic as follows:
1. Are there any vehicles or people violating traffic rules?
2. Are there any accidents or near-accidents occurring?
3. Are there any objects or people obstructing the normal flow of traffic?
4. Are there any unusual or unexpected behaviors from pedestrians or drivers?
5. Are there any emergency vehicles or personnel present?
The second example is another set of guiding questions Qstore that only ask the VLM to identify anomalies in a store
setting, which includes questions like:
1. Are there any individuals loitering or behaving suspiciously inside the store?
2. Is there any unusual activity inside the store, such as tampering with items or attempting to enter restricted areas?
3. Are there any signs of forced entry or damage to the store’s entrance?
4. Are there any individuals present who seem to be watching or waiting for something specific inside the store?
5. Are there any interactions between individuals inside the store that appear suspicious or out of the ordinary?
17
Thus, Qtraffic and Qstore focuses on the specific anomalies of traffic accidents and shoplifting, respectively, while the Q∗
that we find focuses on general cases and includes the following questions:
1. Are there any people in the video who are not in their typical positions or engaging in activities that are not consistent
with their usual behavior?
2. Are there any vehicles in the video that are not in their typical positions or being used in a way that is not consistent with
their usual function?
3. Are there any objects in the video that are not in their typical positions or being used in a way that is not consistent with
their usual function?
4. Is there any visible damage or unusual movement in the video that indicates an anomaly?
5. Are there any unusual sounds or noises in the video that suggest an anomaly?
The comparison results of Q∗ , Qtraffic , and Qstore in detecting anomalies in general cases (all testing videos on UCFCrime), traffic scenes (testing videos from the Traffic Accident category on UCF-Crime), and the store scenes (testing videos
from the Shoplifting category on UCF-Crime) are shown in Table 12. It indicates that Q∗ performs the best in both general
cases and two specific cases like in traffic and store scenes. This is because the overly specific definition of anomalies
like Qtraffic and Qstore makes it harder for a VLM to classify one clip into an anomaly and leads to more false negatives
in its prediction given those specific questions, which degrades the performance. Therefore, we recommend using general
questions like the ones shown in Q∗ in frozen VLMs for VAD.
Questions
Scenario
Traffic Store
70.43 72.58
67.53
/
/
44.84
All
86.55
82.59
76.67
Q∗
Qtraffic
Qstore
Table 12. General guiding questions outperform specific ones measured by AUC (%) on UCF-Crime. Specific questions are not tested on other specific
scenarios, which is indicated by a slash (/).
B.3. Hyperparameters in Inference and Sensitivity Test
Hyperparameters in Inference During inference, in Step 1, following [55], the interval between each segment center d is
16 frames. In Step 2, we use ImageBind [11] as the feature extractor in computing segment similarity as [55] does, and the
number of retrieved segments K depends on the total number of segments h in each test video V . Setting K to (0.1 · h) to
(0.15 · h) is generally good. We set K to (0.1 · h) for UCF-Crime and to (0.15 · h) for XD-Violence. The temperature τ in the
Softmax function is set to 10 for both datasets in Eq. (4). In Step 3, due to the properties of datasets, we set the filter size ω of
G(p) to 15 and σ1 to 10 for UCF-Crime, while setting ω to 30 and σ1 to 30 for XD-Violence. For position weighting, we set
c = floor(F/2) and σ2 = floor(F/2) for both datasets to make sure the position weight covers the whole video sequence.
W.l.o.g, we test the sensitivity of the VAD performance of VERA regarding hyperparameters on UCF-Crime.
Sensitivity Test for K. As shown in Table 13, as the number of retrieved segments increases from 0 to 0.15 · h, the AUC
gradually increases from to 85.21% to 86.61%. Meanwhile, if we randomly select 0.1 · h segments for retrieval, the AUC
is even lower than the performance without retrieval. Thus, using Eq. (4) for retrieval is necessary. Meanwhile, having a
large K greater than 0.15 · h will introduce some noise in Eq. (4) and downgrade the AUC slightly. Thus, selecting 0.1 · h or
0.15 · h for K is generally good choice.
Ratio (%)
0
5
10
15
20
25
AUC (%)
85.21
86.48
86.55
86.61
86.42
86.19
Table 13. Influence of the number of retrieved segments on AUC. The AUC of not using retrieval (Ratio = 0%) and randomly selecting 10% segments for
Eq. (4) is 85.21% and 84.55%, respectively.
Sensitivity Test for ω. The filter size decides how many local segments are incorporated for the current segment for Gaussian
smoothing. From Table 14, we find that AUC converges when the filter size increases to 15. Meanwhile, the VAD performance measured AUC is insensitive to ω and does not fluctuate much. Thus, we can set the filter size with a medium number
like 15.
ω
5
10
15
20
25
AUC (%)
86.25
86.43
86.55
86.61
86.60
Table 14. Influence of filter size ω in Gaussian Smoothing on AUC.
18
Learned Guiding Questions !∗ in VERA
1. Are there any people in the video who are ①
not in their typical positions or engaging in
activities that are not consistent with their
usual behavior?
... ②
!: The people in the video are
in their typical positions and
engaging in activities consistent
with their usual behavior. There
are no vehicles or objects in
unusual positions or being used
in ways inconsistent with their
usual function. There is no
visible damage or unusual
movement in the video. There
are no unusual sounds or noises
in the video. Based on the
analysis, there is no anomaly in
the video.
2. Are there any vehicles in the video that are
not in their typical positions or being used
in a way that is not consistent with their
usual function?
3. Are there any objects in the video that are
not in their typical positions or being used
in a way that is not consistent with their
usual function?
4. Is there any visible damage or unusual
movement in the video that indicates an
anomaly?
5. Are there any unusual sounds or noises
in the video that suggest an anomaly?
... ③
!: There are no people in the
video who are not in their typical
positions or engaging in activities
that are not consistent with their
usual behavior. There are no
vehicles in the video that are not
in their typical positions or being
used in a way that is not consistent
with their usual function. There
are no objects in the video that are
not in their typical positions or
being used in a way that is not
consistent with their usual
function. There is no visible
damage or unusual movement in
the video that indicates an
anomaly. There are no unusual
sounds or noises in the video that
suggest an anomaly.
!: There are no people in the
video who are not in their typical
positions or engaging in activities
that are not consistent with their
usual behavior. There are no
vehicles in the video that are not
in their typical positions or being
used in a way that is not consistent
with their usual function. There
are no objects in the video that are
not in their typical positions or
being used in a way that is not
consistent with their usual
function. There is no visible
damage or unusual movement in
the video that indicates an
anomaly. There are no unusual
sounds or noises in the video that
suggest an anomaly.
Figure 10. Given the normal video “Normal Videos 018 x264”, the frozen VLM (InternVL2-8B) can conclude that no anomaly happens in the video under
the guidance of Q∗ , which is aligned with the ground truth. Since the anomaly scores for all scenes are zeros by VERA, we do not show the complete
anomaly scores with an additional figure.
Sensitivity Test for σ1 . The AUC performance is also robust on the choice of σ1 . As, shown in Fig. 15, when we set
σ1 greater than 1, the AUC generally remains around 86.50%, which again shows the robustness of the design of anomaly
scoring in VERA. We can set σ1 as 10 for VERA.
σ1
1
5
10
15
20
AUC (%)
86.17
86.49
86.55
86.49
86.54
Table 15. Influence of σ1 in Gaussian Smoothing on AUC.
Sensitivity Test for τ . The temperature hyperparameter τ in Eq. (4) controls the entropy of the distribution obtained from
the Softmax function while preserving the rank of each element. As demonstrated in Table 16, when τ is a small number
like 10e-8 that is close to 0, the distributions tend to become a trivial distribution with all mass concentrated on the highestprobability class (corresponding to the segment itself), and the result is the same as the one by not using retrieval. As we
gradually increase τ to a reasonably large number (from 0.01 to 1), the AUC value converges around 86.55% with no obvious
fluctuation, again proving the robustness of anomaly scoring in VERA regarding hyperparameter selection. Note that when
τ approaches +∞, the distribution tends to become a uniform distribution, which yields an AUC of 86.59%. From the
discussion above, we can generally choose τ to be an number in [0.01, 1] in implementation.
τ
10e-8
0.01
0.1
1
+∞
AUC (%)
85.21
86.31
86.55
86.58
86.59
Table 16. Influence of τ in Eq. (4) on AUC.
Sensitivity Test for σ2 . From Table 17, we find that setting σ2 = 0.5F encodes the position information best in the anomaly
score. A drop is noticeable if we choose σ2 less than 0.5F for it will not cover the whole sequence, which is reasonable,
while choosing a σ2 great than 0.5F does not change much. Thus, based on the physical meaning of σ2 , which controls the
width of the distribution, we should make σ2 equal to 0.5F in anomaly scoring.
σ2
w/o Weighting
0.25
0.5
0.75
AUC (%)
85.48
85.43
86.55
86.27
Table 17. Influence of σ2 in Position Weighting on AUC.
B.4. Additional Qualitative Results & Case Studies
W.l.o.g., we take one normal video (“Normal Videos 018 x264”) and another abnormal video (“RoadAccidents127 x264”)
from the UCF-Crime dataset to demonstrate the explanations provided by a frozen VLM (InternVL2-8B) achieved by using
the learned guiding questions Q∗ .
First, in Fig. 10 we showcase the explanation of anomaly scoring by VERA regarding a normal video “Normal Videos 018 x264” in UCF-Crime, which is taken in an airport hallway where no anomaly happens. For this video,
19
Learned Guiding Questions !∗ in VERA
1. Are there any people in the video who are ①
... ②
... ③
not in their typical positions or engaging in
activities that are not consistent with their
usual behavior?
!: The video shows a red truck
!: The video shows people
!: There are no people in the
2. Are there any vehicles in the video that are
not in their typical positions or being used
in a way that is not consistent with their
usual function?
driving on a road with other
vehicles and pedestrians. The
truck appears to be in its
typical position and engaging
in a consistent activity.
Therefore, there is no anomaly
in the video.
3. Are there any objects in the video that are
not in their typical positions or being used
in a way that is not consistent with their ④
usual function?
There is no visible damage or unusual
movement, and no unusual sounds or
noises are present. Therefore, there is
no anomaly in the video.
video, and the vehicles and
objects are in their typical
positions and functions. There is
no visible damage or unusual
movement, and no unusual sounds
or noises are present. Therefore,
there is no anomaly in the video.
... ⑥
... ⑤
4. Is there any visible damage or unusual !: There are no people in the video,
movement in the video that indicates an and the vehicles and objects are in
their typical positions and functions.
anomaly?
5. Are there any unusual sounds or noises
in the video that suggest an anomaly?
crossing the street, but one
person is not in a typical
position and is running,
which could be an anomaly.
Based on the analysis, there is
an anomaly in the video.
!: The video shows a red
truck colliding with a train at a
crossing. This is an unusual
and dangerous situation,
indicating an anomaly.
!: The video shows a train on
the tracks, and there are no people
or vehicles in unusual positions or
engaging in activities that are not
consistent with their usual
behavior. Based on the analysis,
there is no anomaly in the video.
Figure 11. Given the abnormal video “RoadAccidents127 x264”, the frozen VLM (InternVL2-8B) can generate reasonable explanations aligned with the
semantic change observed in each scene under the guidance of Q∗ . The complete anomaly scores are shown in Fig. 12.
Anomaly Score
VERA assigns a 0 score to each frame. As shown in Fig. 10, for the selected scenes in this video, VERA explains that this
is because there are no events that conform to the anomaly descriptions in Q∗ . Such explanations are consistent with the
recording and again manifest the effectiveness of eliciting the reasoning ability in a frozen VLM for VAD by using learned
guiding questions. Note that we do not have an additional figure illustrating the anomaly score dynamic for this video because
all scenes are assigned 0 scores by VERA.
Next, we select 6 representative scenes in the abnormal video (“RoadAccidents127 x264”) and show the corresponding explanation provided by the frozen
①
④
⑤ ⑥
③
②
0.4
VLM in Fig. 11. The main anomaly that happens in this video is a traffic accident
where a truck crashes into a train from Frame 2160 to Frame 2299, which corresponds
0.3
to the 5th scene in Fig. 11. In particular, the figure shows that the learned question
0.2
“Is there any visible damage or unusual movement in the video that indicates an
0.1
anomaly?” in Q∗ makes the frozen VLM find a good way to express what it sees
0.0
in the 5th scene and understand this is an anomaly because the crash is unusual and
∗
500
1000
1500
2000
2500
dangerous. The other scenes are also well explained by the frozen VLM under Q .
0
Frame Index
Thus, this again verifies that the learned guiding questions can successfully trigger
Figure 12. Anomaly scores generated by
reasonable explanations in the adopted frozen VLM for VAD.
VERA (with InternVL2-8B) in “RoadAcciMeanwhile, we also include the anomaly scores generated by VERA for the ab- dents127 x264” from UCF-Crime.
normal video in Fig. 12. Most frames are assigned to zero except the scenes when
someone crosses the road at an unusual speed (the 2nd scene in Fig. 11) and the truck-train crash happens (the 5th scene in
Fig. 11). This fluctuation is aligned with the ground truth annotation and common sense about an anomaly, which shows that
the anomaly scoring proposed in VERA is reasonable.
20
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )