DINOv2 PCA visualization code

DINOv2

Foundation models such as CLIP and DINO have garnered significant attention from researchers. To keep up with recent technological advances, I have studied DINOv2. Briefly, DINOv2 excels in extracting semantic features due to its training with a Self-Supervised Learning (SSL) method. This ability is evident in Figure 1 of the DINOv2 paper, which visualizes the first three components of PCA. In this figure, similar parts of objects are colored the same. For instance, the wings of an eagle and an airplane are green, while their heads are red. These results demonstrate DINOv2’s proficiency in extracting semantic features from images. I have implemented code to reproduce this PCA visualization.

PCA visualization

For my analysis, I used four images of different dog breeds (a Corgi, a Shiba Inu, and a Retriever), each in various poses and shapes.

dog1 dog2 dog3 dog4

However, the PCA values (RGB) indicated similar parts of the dogs. For example, feet are colored purple and heads are colored orange.

Result

results

These results confirm DINOv2’s capability to extract useful and semantic features from an image.


Code

For a more detailed view of the implementation, please visit my GitHub repository

code1 Firstly, various libraries such as ‘sklearn’ for PCA and ‘torch’ for DINOv2 are loaded.


code2 The input image size should be exactly divisible by 14.
DINOv2 can be loaded with torch.hub.
You can choose the DINOv2 version like ‘dinov2_vits14’.

DINOv2 version:

import torch

# DINOv2
dinov2_vits14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14')
dinov2_vitb14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
dinov2_vitl14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
dinov2_vitg14 = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14')

# DINOv2 with registers
dinov2_vits14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vits14_reg')
dinov2_vitb14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14_reg')
dinov2_vitl14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14_reg')
dinov2_vitg14_reg = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitg14_reg')


code3 This PCA model is utilized for background removal.


code4 Masks for each image are created using the PCA model. The value 0.6 is determined empirically


code5 This PCA model is used for visualizing semantic features. The PCA model inputs foreground pixels x_norm_patchtokens[i,masks[i],:].


code6 This PCA model is also used for feature visualization. However, unlike the previous model, this one processes all pixels, including the background, x_norm_patchtokens.

The results can be shown below. results


For more detailed code, please click on this link. If you have any questions, feel free to contact me via email at jucha@unist.ac.kr.