| Before training the visual detector, we first extract at most 1000 visual proposals using Edge Boxes from OpenCV. 
					We choose Edge Boxes over other CNN-based alternatives because it is a generic, dataset-independent proposal generation technique, as opposed to other supervised 
					CNN-based alternatives that are trained end-to-end using a specific dataset. We use Pytorch as our training framework. A convolutional 
					feature map is obtained using CNN layers of the VGG16 network pretrained with ImageNet before ROIAlign, then the first 
					two fully-connected (FC) layers are used as box classifier to extract visual features. We preserve the original aspect ratio of the images and resize them to five 
					different scales {480, 576, 688, 864, 1200} as described in here. As a type of data augmentation, we apply random horizontal flips to the images 
					and choose a scale at random during training. At test time, we calculate the average outputs of 10 images (the 5 scales and their flips). The audio is separated into 
					960 ms non-overlapping frames. A short-time Fourier transform is used to break down the 960 ms frames, with 25 ms windows every 10 ms. To prevent numerical difficulties, 
					the resulting spectrogram is integrated into 64 mel-spaced frequency bins, and the magnitude of each bin is log-transformed after a slight offset is added. This produces 
					log-mel spectrogram patches with 96 x 64 bins for each 0.96-second audio region. Before training the detector, offline we extract the audio conv feature maps for each region 
					using CNN layers of the VGGish network pre-trained on the large YouTube dataset. These are fed to two FC layers during training. 
					We use the indirect path, attention path and the combination of paths only during inference. They are not part of the training, but the audio-visual similarity that they rely 
					on is learned in training. The learnable temperature parameter ρ is 0.07, and the weighting 
					parameters Λ1, Λ2, and Λ3 are 0.6, 0.2 and 0.2, respectively. We observe different lambda values affect the speed of training. After all losses converge, 
					the results become the same so our approach is not sensitive to the choice of lambdas to change the result. We use the Adam optimizer by experimenting 
					different learning rates and weight decays. The learning rate of 1e-5 and the weight decay of 5e-4 perform best for both dataset. While a batch size of 1 is used during audio-only and 
					video-only training, we used the batch size B is 12 for joint training with the NCE contrastive loss. The models are jointly trained with 20K iterations. |