Before training the visual detector, we first extract at most 1000 visual proposals using Edge Boxes from OpenCV.
We choose Edge Boxes over other CNN-based alternatives because it is a generic, dataset-independent proposal generation technique, as opposed to other supervised
CNN-based alternatives that are trained end-to-end using a specific dataset. We use Pytorch as our training framework. A convolutional
feature map is obtained using CNN layers of the VGG16 network pretrained with ImageNet before ROIAlign, then the first
two fully-connected (FC) layers are used as box classifier to extract visual features. We preserve the original aspect ratio of the images and resize them to five
different scales {480, 576, 688, 864, 1200} as described in here. As a type of data augmentation, we apply random horizontal flips to the images
and choose a scale at random during training. At test time, we calculate the average outputs of 10 images (the 5 scales and their flips). The audio is separated into
960 ms non-overlapping frames. A short-time Fourier transform is used to break down the 960 ms frames, with 25 ms windows every 10 ms. To prevent numerical difficulties,
the resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed after a slight offset is added. This produces
log-mel spectrogram patches with 96 x 64 bins for each 0.96-second audio region. Before training the detector, offline we extract the audio conv feature maps for each region
using CNN layers of the VGGish network pre-trained on the large YouTube dataset. These are fed to two FC layers during training.
We use the indirect path, attention path and the combination of paths only during inference. They are not part of the training, but the audio-visual similarity that they rely
on is learned in training. The learnable temperature parameter ρ is 0.07, and the weighting
parameters Λ1, Λ2, and Λ3 are 0.6, 0.2 and 0.2, respectively. We observe different lambda values affect the speed of training. After all losses converge,
the results become the same so our approach is not sensitive to the choice of lambdas to change the result. We use the Adam optimizer by experimenting
different learning rates and weight decays. The learning rate of 1e-5 and the weight decay of 5e-4 perform best for both dataset. While a batch size of 1 is used during audio-only and
video-only training, we used the batch size B is 12 for joint training with the NCE contrastive loss. The models are jointly trained with 20K iterations.
|