Nombre: FÁBIO RICARDO OLIVEIRA BENTO
Fecha de publicación: 20/10/2023
Supervisor:
Nombre | Papel |
---|---|
PATRICK MARQUES CIARELLI | Co-advisor * |
RAQUEL FRIZERA VASSALLO | Advisor * |
Junta de examinadores:
Nombre | Papel |
---|---|
JUGURTA ROSA MONTALVÃO FILHO | External Examiner * |
MARIANA RAMPINELLI FERNANDES | External Examiner * |
PATRICK MARQUES CIARELLI | Internal Examiner * |
PLINIO MORENO LÓPEZ | External Examiner * |
RAQUEL FRIZERA VASSALLO | Advisor * |
Sumario: The anomaly detection problem involves identifying events that do not follow an expected pattern of behavior. This paper addresses the of automatically detecting abnormal activity in videos using only information from frames. This is especially useful when auxiliary data from object detection, tracking, or human pose are unavailable or unreliable. The initial approach adopts convolutional neural networks to extract spatial features, followed by a time series classifier composed of a one-dimensional convolution layer and a set of stacked recurrent neural networks. The proposed methodology selects a pre-trained convolutional architecture as a feature extractor and uses transfer learning to specialize another network with the same architecture for detecting anomalies in surveillance videos. Experiments were conducted on the UCSD Anomaly Detection and CUHK Avenue datasets to compare the proposed approach with other studies. The evaluation protocol uses the metrics Area Under the Receiver Operating Characteristic Curve (AUC), Area Under the
Precision vs Recall Curve (AUPRC) and Equal Error Rate (EER). During the experiments, the model achieved an AUC greater than 92% and an EER less than 9%, results consistent with the current literature. We next proposed a model that learns the global and local features of video frames. At the frame level, we used an FPN (Feature Pyramid Network)-based architecture to extract
global features. At the patches level, we use a ViT (Vision Transformer) based architecture to extract local features. We then employ a sequential classifier that combines Transformers and LSTM (Long Short-Term Memory) networks to generate an anomaly score for each frame, based on a sequence of position-encoded embeddings. During model training, we use the Class-Balanced Focal Loss (CBFL) loss function to handle imbalance between classes. This function assigns more significant weights to classes with fewer samples, ensuring a
balanced contribution of each class to the overall loss. CBFL improves model performance in unbalanced classification tasks, especially when dealing with underrepresented classes, such as the abnormal class in the context of video anomaly detection. We perform experiments on the UBnormal dataset to evaluate our approach and compare our results with existing work. In addition, we analyzed anomaly scores at the frame level over time and t-SNE plots for further insights. Our results, evaluated by the micro-average AUC and macro-average AUC metrics, are consistent with the current state of the art.
Keywords: smart cities, computer vision, deep learning, anomaly detection.