Complementary Cues from Audio Help Combat Noise in Weakly-Supervised Object Detection

Cagri Gungor

Adriana Kovashka

[Paper]

To appear in 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV 2023)

Illustration of one of our contributions which utilizes complementary audio cues. Our method includes a region-based audio-visual instance discrimination module which produces an audio-visual region similarity, in conjunction with the visual classification scores, to create an indirect visual path through audio and improve the accuracy of predicted object labels.

Abstract

We tackle the problem of learning object detectors in a noisy environment, which is one of the significant challenges for weakly-supervised learning. We use multimodal learning to help localize objects of interest, but unlike other methods, we treat audio as an auxiliary modality that assists to tackle noise in detection from visual regions. First, we use the audio-visual model to generate new ''ground-truth'' labels for the training set to remove noise between the visual features and noisy supervision. Second, we propose an ''indirect path'' between audio and class predictions, which combines the link between visual and audio regions, and the link between visual features and predictions. Third, we propose a sound-based ''attention path'' which uses the benefit of complementary audio cues to identify important visual regions. We use contrastive learning in our framework to perform region-based audio-visual instance discrimination for sound localization that is used as an intermediate task to benefit from the complementary cues from audio to boost object classification and detection performance. We show our methods to update noisy ground-truth and to provide an indirect path and attention path, greatly boosting performance on the AudioSet and VGGSound datasets compared to single-modality predictions, even ones that use contrastive learning. Our method outperforms previous weakly-supervised detectors for the task of object detection and our sound localization module performs better than several state-of-art methods on the AudioSet and MUSIC datasets.

Implementation Details

Before training the visual detector, we first extract at most 1000 visual proposals using Edge Boxes from OpenCV. We choose Edge Boxes over other CNN-based alternatives because it is a generic, dataset-independent proposal generation technique, as opposed to other supervised CNN-based alternatives that are trained end-to-end using a specific dataset. We use Pytorch as our training framework. A convolutional feature map is obtained using CNN layers of the VGG16 network pretrained with ImageNet before ROIAlign, then the first two fully-connected (FC) layers are used as box classifier to extract visual features. We preserve the original aspect ratio of the images and resize them to five different scales {480, 576, 688, 864, 1200} as described in here. As a type of data augmentation, we apply random horizontal flips to the images and choose a scale at random during training. At test time, we calculate the average outputs of 10 images (the 5 scales and their flips). The audio is separated into 960 ms non-overlapping frames. A short-time Fourier transform is used to break down the 960 ms frames, with 25 ms windows every 10 ms. To prevent numerical difficulties, the resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed after a slight offset is added. This produces log-mel spectrogram patches with 96 x 64 bins for each 0.96-second audio region. Before training the detector, offline we extract the audio conv feature maps for each region using CNN layers of the VGGish network pre-trained on the large YouTube dataset. These are fed to two FC layers during training. We use the indirect path, attention path and the combination of paths only during inference. They are not part of the training, but the audio-visual similarity that they rely on is learned in training. The learnable temperature parameter ρ is 0.07, and the weighting parameters Λ₁, Λ₂, and Λ₃ are 0.6, 0.2 and 0.2, respectively. We observe different lambda values affect the speed of training. After all losses converge, the results become the same so our approach is not sensitive to the choice of lambdas to change the result. We use the Adam optimizer by experimenting different learning rates and weight decays. The learning rate of 1e-5 and the weight decay of 5e-4 perform best for both dataset. While a batch size of 1 is used during audio-only and video-only training, we used the batch size B is 12 for joint training with the NCE contrastive loss. The models are jointly trained with 20K iterations.