Video Similarity with Self-supervised Transformer Network.

Do you ever wonder how many times your favorite movie exists on digital platforms?

My favorite animated video when I was a child was Happy Hippo. I spent innumerable hours watching videos of the plump hippopotamus on YouTube. One thing that I remember clearly, however, is how many videos of the same clip were uploaded. Some were reaction vids, others had funny songs and photo themes in the background. Now that I am older, I am wondering: Can we figure out actually how many versions of the same video exist. Or, more scientifically, can we extract a probability score for video pair match?


Content-based video retrieval is drawing more and more attention more and more these days. Visual content is one of the most popular types of content on the internet. And there is an incredible degree of redundancy, as we observe a high number of near or exact duplicates of all types of videos. Even common users come across the same videos in their daily use of the web, from YouTube to Tik Tok. It plays an even more important role in many video-related applications, including copyright protection, where especially movie trailers are targets for re-uploads, general piracy or just reaction videos. Usually, techniques such as zoom, crop or slight distortions are being used to differentiate the duplicate video in order not to be taken down.

Photo by Kasra Askari on Unsplash


From Visual Spatio-Temporal Relation-Enhanced Network and Perceptual Hashing Algorithms to Fine-grained Spatio-Temporal Video Similarity Learning and Pose-Selective Max Pooling, video similarity is a popular task in the field of computer vision. While finding exact duplicates is a task which can be executed in a variety of different ways, near duplicates or videos with modifications still pose a challenge. Most video retrieval systems require a large amount of manually annotated data for training, making them costly and inefficient. To match the current rhythm of video production, an efficient self-supervised technique needs to emerge in order to tackle the space and calculation shortcomings.

What is Self-supervised Video Retrieval Transformer Network?

Based on the research effort that has been done on “Self-supervised Video Retrieval Transformer Network” (He, Xiangteng, Yulin Pan, Mingqian Tang and Yiliang Lv) we replicate the architecture with certain modifications.

To begin, we introduce the suggested Self-Supervised Video Retrieval Transformer Network (SVRTN) for effective retrieval, by decreasing the costs of manual annotation, storage space, and similarity search. As indicated in the previous image, it primarily is comprised of two components: self-supervised video representation learning and clip-level set transformer network. Initially, we use temporal and spatial adjustments to construct the video pairings automatically. Then, via contrastive learning, we use these video pairs as supervision to learn frame-level features. Finally, we use a self-attention technique to aggregate frame-level characteristics into clip-level features, using masked frame modeling to improve robustness. It leverages self-supervised learning to learn video representation from unlabeled data, and exploits transformer structure to aggregate frame-level features into clip-level.

Self-supervised video representation learning is used to learn the representation from pairs of videos and their transformations, which are generated automatically via temporal and spatial transformations, eliminating the significant costs of manual annotation. SVRTN technique can learn better video representation from a huge number of unlabeled films due to self-generation of training data, resulting in improved generalization for its learned representation.

A clip-level set transformer network is presented for aggregating frame-level data into clip-level features, resulting in significant storage space and search complexity savings. It can learn complementary and variant information from clip frame interactions via the self-attention mechanism, as well as frame permutation and missing invariant ability to manage the issue of missing frames, all of which improve the clip-level feature’s discriminating and resilience. Furthermore, it allows more flexible retrieval methods including clip-to-clip and frame-to-clip retrieval.

Self-supervised – Self-generation

After collecting a large amount of videos, temporal and spatial transformations are sequentially performed on these clips to construct the training data.

Temporal Transformations: To create the anchor clip, evenly sample N frames with a set time interval r. Then, from the anchor clip, a frame Im is chosen at random as the identical material shared by anchor clip C and positive clip C+. We consider the chosen frame to be C+’s median frame, and we sample (N1)/2 frames forward and backward with a different sample time interval r+.

Spatial Transformation: We then apply spatial transformations on each frame. Three forms of spatial transformations are explored:  Photometric transformation (a). It covers brightness, contrast, hue, saturation, and gamma adjustments, among others. Geometric transformation (b). It offers horizontal flip, rotation, crop, resize, and translation adjustments. c) Transformation editing It includes effects such as creating a blurred background, a logo, an image in picture, and so on. For the logos, we use the sample dataset of LLD-logo which consists of 5000 logos(32×32 resolution-PNG). During the training stage, we pick one transformation from each type of spatial transformation at random and apply it to frames from positive clips in order to create new positive clips.

Triplet Loss

Triplet loss is a loss function where a reference input is compared to a matching and a non-matching input. The distance from the anchor to the positive is minimized, and to the negative input is maximized. In our project, we use triplet loss instead of the contrastive loss in both Frame-level and Clip level.

Video Representation Learning – Frame-level Model

We employ the supervised video pairs to train the video representation with frame-level triplet loss, because they have been generated. To acquire the frame-level feature, a pretrained ResNet50 is used as the feature encoder, followed by a convolutional layer to lower the channel number of the feature map, and finally average pooling and L2 normalization.

By minimizing the distance between features of the anchor clip frames and positive clip frames, as well as maximizing the distance between features of the anchor/positive clip frames and negative clip frames, video representation learning aims to capture spatial structure from individual frames while ignoring the effects of various transformations.

Clip-level Set Transformer Network

Because subsequent frames from the same clip have comparable material, frame-level features are highly redundant, and supplementary information is not completely investigated. Specifically, self-supervised video representation learning is used to extract a series of frame-level features from a clip, which are then aggregated into a single clip-level feature x.

We present a modified Transformer, the clip-level set transformer network, to encode the clip-level feature. Instead of utilizing a Transformer to encode the clip-level feature directly, we use the set retrieval concept in the clip-level encoding. Without position embedding, we just utilize one encoder layer with eight attention heads. It gives our SVRTN method the following capabilities:

  1. More robust: We increase the robustness of the learned clip-level features with the ability of frame permutation and missing invariant.
  2. More flexible: We support more retrieval manners, including clip-to-clip retrieval and frame-to-clip retrieval.

We treat the frames of one clip as a set and randomly mask some frames in clip-level encoding to improve the robustness of the learnt clip-level features. We drop some frames at random from a clip C to create a new clip C’. The purpose of this exercise is to eliminate the influence of frame blur or clip cut, and to enable the model to retrieve its corresponding clips using any combination of frames in the clip. Then we use them to calculate the triplet loss.

Video Similarity Calculation

We perform shot boundary recognition on each video to segment it into shots, and then divide the shots into clips at a set time interval, i.e. N seconds. Second, to generate the clip-level feature, the sequence of successive frames is transmitted via the clip-level set transformer network. Finally, IsoHash binarizes the clip-level functionality to further reduce storage and search costs. We use hamming distance to measure clip-to-clip similarity when retrieval.

Shots are extracted with shot boundary/transition detection with the use of TransNetV2. The lift and projection version of IsoHash has been used for the binarization.


We used a variety of modifications to evaluate our model with videos of sports, news, animation and movies.

  • 53 Transformations:
    • Size : Crop
    • Time : Fast Forward
    • Quality : Black & White
    • Others : Reaction
  • Most efficient categories: fast, intro-outro, watermark, contrast, slow, B&W effect
  • Less efficient categories: extras, black-white, color yellow, frame insertion, colorblue, resize
  • The model performs well and is tolerant under zoom/crop. There is no direct relation between these attributes and similarity but it seems that medium levels are the most efficient.
  • There seems to be a relation between the number of shots and similarity.
  • Reduced space and calculation cost

Useful Links

Future work: Video Similarity and Alignment Learning on Partial Video Copy Detection

Possible extension:

SVD Dataset: SVD – Short Video Dataset