Name: CLEBESON CANUTO DOS SANTOS
Type: PhD thesis
Publication date: 17/12/2020
Advisor:
Name | Role |
---|---|
RAQUEL FRIZERA VASSALLO | Advisor * |
Examining board:
Name | Role |
---|---|
RAQUEL FRIZERA VASSALLO | Advisor * |
DOUGLAS ALMONFREY | External Examiner * |
PATRICK MARQUES CIARELLI | Internal Examiner * |
Summary: This thesis aims to investigate and propose mechanisms for recognizing and anticipating dynamic gestures and actions based only on computer vision. Three proposals are focused on gesture recognition: Star RGB - a representation that condenses the montion contained in the frames of a video into only one RGB image; Star iRGB - an iterative version of Star RGB that can be used by learning models of sequential nature; and Star iRGBhand - an
iterative model for recognizing gestures that uses the shape of the hands as context. For action anticipation, bayesian models based on recurrent neural networks were presented, which uses context information to reduce the ambiguity between similar movements in addition to a threshold on the estimated epistemic uncertainty to decide when an action should be anticipaded. In this context, two models have been proposed to recognize and
anticipate gestures online. All proposals were validated through several experiments whose results were compared to several baselines. In this sense, three main datasets were used: Montalbano, for gestures captured by only one camera; IS-Gesture, for gestures captured in a multi-camera environment; and Acticipate, for action anticipation. The results achieved with the gesture recognition models were the best for the Montalbano set when considering
works that use only RGB images. Even when compared to multimodal models, based on CNN 3D, the results are among the best, just slightly behind (less than 1%) two multimodal proposals. In the task of anticipating actions, the accuracy of recognition and anticipation obtained when using the dataset Acticipate were the best ones achieved so far. Finally, considering the models that aim to recognize and anticipate gestures online, the proposed model that works with only one camera has also achieved results among the best in literature for the Montalbano dataset. In relation to IS-Gesture, which represents the most complex challenge due to the multi-camera environment, the average accuracy of recognition and anticipation of gestures was considered satisfactory, with clear indications of WHERE improvements should be made to achieve better results. Regarding the execution time, the proposed models were all able to provide information for an application that requires a frame rate of up to 10 FPS. Thus, it is possible to use such models in an
interactive application in real time, in an environment with one or several cameras. In summary, all the proposals have shown to be very promising, obtaining results that go beyond the main related works that address the previously mentioned datasets.