On The Effects of Learning Views on Neural Representations in Self-Supervised Learning

Update: A small section of LaTeX in the post is not rendering correctly. Will fix it soon.

A blog post on Viewmaker Networks: Learning Views for Unsupervised Representation Learning. ICLR, 2021

About this Blog Post:

Contrastive self-supervised learning (SSL) methods require domain expertise to select a set of data transformations that work well on a specific downstream task, modality, and domain. This creates a bottleneck for the utilization of SSL strategies on new datasets, particularly those sampled from different domains. This blog post details the Viewmaker Networks, a method that attempts to overcome this issue by using constrained perturbations and an adversarial training objective to synthesize novel views. This method can be broadly extended to many domains, as it learns views directly from the training data. We cover the details of the methodology and related concepts as they apply to contrastive visual representation learning. This blog post also develops further insights into the visual representations learned using the Viewmaker’s augmented views by conducting additional experiments. Specifically, we investigate the dimensional collapse and representational similarity between differently trained models (handcrafted views and viewmaker views). We observe that training models using Viewmaker views not only make better use of the embedding space but also learn a different function of the data when compared to training models using handcrafted views, something one should carefully consider when choosing one over the other. We feel these observations will encourage further discussions on learned views in self-supervised learning.

Fig.1: The Viewmaker Network Setup

Contrastive self-supervised learning(SSL) methods require domain expertise to select a set of data transformations that work well on a specific downstream task, modality, and domain. This creates a bottleneck for the utilization of SSL strategies on new datasets, particularly those sampled from different domains. This blog post details the Viewmaker Networks, a method that attempts to overcome this issue by using constrained perturbations and an adversarial training objective to synthesize novel views. This method can be broadly extended to many domains, as it learns views directly from the training data. We cover the details of the methodology and related concepts as they apply to contrastive visual representation learning.

The amount of data available to us is exploding. Despite the availabilty of such large amounts of data, we are currently unable to exploit it because the majority of existing deep learning research relies on annotated data. We require a learning method which can use training data with fewer or no labels and yet perform well when identifying patterns in unseen data. This learning paradigm, which doesn’t assume availabilty of labelled training data, is called Unsupervised Learning. We specifically explore a subset of unsupervised methods, which utilizes pretext tasks that develop a training objective which exploits the structure in the data itself. This subset of Unsupervised Learning is called Self-Supervised Learning(SSL).

Self-Supervised learning (SSL) requires a human expert to design a pretext task. Recently, constrastive learning has emerged as an effective and popular pretext task. It works by extracting a weak supervisory signal from carefully chosen data transformations, typically specified by a domain expert. This creates a barrier for the use of such models by non-domain experts as well as issues with usabilty across various domains and modalities. Viewmaker Network attempts to remove this barrier by proposing a network which is domain agnostic and yet generates useful learned views (i.e. augmented images) which can be used for model training and show significant success on a variety of classification tasks across various datasets and modalities. This blog post also develops further insights into the visual representations learned using the Viewmaker’s augmented views by conducting additional experiments. Specifically, we investigate the dimensional collapse and representational similarity between differently trained models (handcrafted views and viewmaker views). We observe that training models using Viewmaker views not only make better use of the embedding space but also learn a different function of the data when compared to training models using handcrafted views, something one should carefully consider when choosing one over the other. We feel these observations will encourage further discussions on learned views in self-supervised learning.

1. Representation Learning

The task of self-supervised representation learning requires discovering ways to learn abstract representations from the data, which capture relevant high-level properties, without any explicit annotations. Recently, representation learning has seen extensive successful usage in deep learning, with the goal of finding more abstract – and ultimately more useful representations.

For visual representations to be useful, they must satisfy a set of criteria. Firstly, they must encode useful semantic properties that explain the image contents. Secondly, the learned representations should be useful for performing downstream image analysis tasks (such as classification). And thirdly, they must be robust to various changes to the image, such as natural variations in lighting or viewpoint.

There are multiple strategies for learning visual representations. When learning visual representations in a supervised manner, the resulting representations are often specific to the supervised task and annotations. This is undesirable, as these representation may not generalize to different tasks i.e. they are not task and label agnostic. One strategy for mitigating these issues and learning better visual representation is to shift to a self-supervised learning approach. SSL approaches do not rely on annotations and are agnostic to downstream tasks, and so these representations might be more general.


2.1 What Are Views?

Fig.2: Different views of a Labrador Retriever. The left most images are natural views sampled by taking different photographs of a Labrador Retriever. The rightmost images are synthetic views sampled by modifying a single natural view.

Many natural images contain visual representations of semantically distinct objects such as animals, plants, furniture, etc. Given an image containing an instance of some object class, this image is only one potential view of that object. For example, in Figure 2, there are two unique images of a Labrador Retriever photographed at different times and from different locations. These are examples of natural views of an object, which are essentially just the set of all possible photographs of that object as it might exist in the real world. Figure 2 also portrays two synthetic views of the Labrador Retriever. These are synthetically augmented images, which modify the image while still retaining the relevant semantic information. Similar to natural views, these are valid visual representations of the object. However, they are derived by modifying a single image from the set of natural views using various augmentation strategies such as geometric transformations of an image or color modification.

Fig.3: SimCLR data augmentation options. The authors included geometric transformations like cropping, rotation, resizing, and flipping; and color transformations like blurring, and color jitter.

2.2 Which Views Are Useful?

Many previous works have achieved significantly improved model performance by proposing a variety of image augmentation strategies. Historically, it was the development of the AlexNet architecture (ImageNet Classification with Deep Convolutional Neural Networks) which popularized the use of augmentations in deep learning. The transformations they considered involved randomly cropped 224×224 patches from the original images, flipping them horizontally, and changing the intensity of the RGB channels using PCA color augmentation. Since then, various novel augmentation strategies have demonstrated effectiveness in various domains. Random Erasing (Random Erasing Data Augmentation) is an augmentation strategy that attempts to mimic the natural object occlusion which occurs in many image modalities. This strategy, demonstrated in Fig. 4, involves randomly selecting a rectangular image region and replacing the pixel values with random values or a single pixel value. Related to this approach (Improved Regularization of Convolutional Neural Networks with Cutout) is CutOut, a procedure that involves randomly masking out square regions of the input during training. The authors find that cutout improves the robustness and overall performance of convolutional neural networks. Another augmentation strategy, known as Mixup, (mixup: Beyond Empirical Risk Minimization) involves augmenting both the input image and the image label. Mixup proceeds by sampling two images from the dataset and interpolating between them by some factor. Then, the one-hot encoded image labels are interpolated by the same factor. Yun et. al. proposed CutMix (CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features), an augmentation strategy that combines CutOut and Mixup. CutMix proceeds by applying CutOut to an input image and then replaces the cutout region with a cropped section from another image. Then, the label is augmented such that it represents the percentage of the image occupied by each object class. DAGAN (Data Augmentation Generative Adversarial Network) utilizes an adversarial setup to learn image augmentations and the authors showed that adversarial training can be used to expand the pool of available data and even apply novel augmentations to unseen classes of data (e.g. few-shot learning). This field of optimizing best augmentation strategies is constantly expanding.

Black White Random
i1 i2 i3
i4 i5 i6
Fig.4: Random Erasing Augmetation Strategy
Fig.5: (left to right) Mixup, Cutout and CutMix data augmentations

The intuition behind using many of the common image transformations is that these augmented views attempt to simulate samples from the true underlying data distribution. This improves the diversity of the available data and can help improve model convergence. This rationale, however, does not explain why unrealistic distortions such as cutout (Improved Regularization of Convolutional Neural Networks with Cutout) and mixup (mixup: Beyond Empirical Risk Minimization) significantly improve generalization. Furthermore, many augmentation strategies do not always transfer across datasets. Cutout, for example, improves performance on CIFAR-10 but not on ImageNet(Improving Robustness Without Sacrificing Accuracy with Patch Gaussian Augmentation). There are also no standard metrics that quantify if an augmentation strategy is good and how well can it be generalized because they are model-dependent. Viewmaker Network aims to remove this dependency on handcrafted views by learning image transformations from the data.

3. Where Does Viewmaker Network Fit In?

Recently, contrastive self-supervised representation learning methods have explored training models which are invariant to different “views,” or augmented versions of an input. However, designing these views requires considerable trial and error by human experts, hindering its widespread adoption across domains and modalities thus leading to the question: how can we better transform image with minimum human expertise?

Images can be transformed in a vast number of ways. However, only a subset of these transformations provide a useful training signal which can be exploited by contrastive methods. Within this subset of informative transformations, only a few are ever selected by experts. As seen in Fig. 6, there is a vast untapped space of image transformations which can be used for contrastive methods. The Viewmaker Network attempts to explore this transformation space, denoted as B, using an adversarial setup without any input from a human expert.

The set of image transformations
Fig.6: Possible sets of image transformations where A = The set of all image transformations; B = the set of all transformations that provide a useful training signal for contrastive methods; C = the set of handcrafted transformations.

4. Learning to Generate Views

Using neural networks to learn image transformations for generating augmented views is not a simple task, as we have no annotated training data that explicitly indicates whether or not a network augmented image is within the set of valid views. An alternative strategy is to utilize an adversarial set up such that one model can guide the other into learning transformations that produce valid views. In an adversarial training setup, we have two neural networks: the generator $G$ and the discriminator $D$. The discriminator is typically tasked with receiving inputs from a real distribution, and a synthetic distribution, and then classifying which distribution each input originated from. The generator is typically optimized to generate an input that can fool the discriminator into believing the input originates from the real distribution. More formally, the discriminator is optimized to maximize the following objective function:

$$\mathbb{E}_x[\texttt{log}D(x)] + \mathbb{E}_z[\texttt{log}(1-D(G(z)))]$$

And the generator is optimized to minimize the following objective function:


where $x$ is an input sampled from the real distribution, and $z$ is a stochastic input that the generator modifies to produce samples that match the real distribution, $D$ is the discrimantor and $G$ is the generator.

The adversarial setup used for training the Viewmaker Networks differs from traditional adversarial training setups in a few distinct ways. First, there is no real distribution. Second, the discriminator is optimized using an instance discrimination objective, such as infoNCE (Representation Learning with Contrastive Predictive Coding), which involves the task of re-identify a transformed version of some instance of the image $x_i$ within a set of many other different images. Third, the generator is tasked with modifying an image as follows:

$$x_{view} = G(x_i) + x_i$$

such that the new image $x_{view}$ maximizes the instance discrimination objective. This setup encourages $D$ to learn representations of images that allow it to re-identify modified images, and it encourages $G$ to learn transformations of $x_i$ such that it is becomes more difficult to distinguish from unrelated images. However, this method has a major failure case – if the generator is allowed to arbitrarily modify $x_i$, then it can learn a solution where it simply destroys all revevant semantic information in $x_i$. Thus, this method requires that a constraint is applied to the transformations learned by $G$ such that it cannot converge to this undesired solution.

Trulli Trulli Trulli

Fig.6. Sample Views from Viewmaker Network at 200th epoch

5.1 Making Novel Views with Adversarial Perturbations

The method proposed by the authors is insipired from adverserial setup, in particular with the use of $l_p$ norms used for adverserial robustness. But what is adversarial robustness and how is it related to samples of our data?

Szegedy et al. and Biggio et al. showed that specifically crafted small perturbations of benign inputs can lead machine-learning models to misclassify them. These perturbed inputs are referred to as adversarial samples. Given a machine-learning model, a sample $\hat{x}$ is considered as an adversarial example if it is similar to a benign sample $x$ (drawn from the data distribution), such that $x$ is correctly classified and $\hat{x}$ is classified differently than x. (On the Suitability of Lp-norms for Creating and Preventing Adversarial Examples)

5.1 $l_p$ Constrainted Perturbations

To explain the $l_p$ contrained perturbations proposed by the authors, we first need to define $l_p$ norm.

For a real number $p \geq 1$, we can define the $l_p$ norm as $||\textbf{w}||_p = (∑_{i}^{n}|w_i|^p)^{1/p}$. Note that we use bold to denote vector notations. Based on the value of $p$, we get the $l_1$, $l_2$, .. , $l_n$ norms (**may be reference which has good details on lp norms?**)

But how does $l_p$ constraints generate perturbations which are helpful for viewmaker network? To explain this we will refer to Fig.7 (pseudo code for viewmaker network) and Fig. 8 which is its representation in an (assumed) 3D space.

Fig.7.Psuedo Code for Viewmaker Network
Fig.8: Representation of adding perturbation to input data in a sample 3D space

Let us assume a 3D space with origin at point 1 (see Fig.8 above). We are working with 3D data points in this example scenario. Let 5 be an input data point in this space and 1 be a generated perturbation in this space. If we were to add points 1 and 5 to perturb the input, there is a very high chance that in such a blind addition, the features of the perturbation can completely overshaddow the features of the input data point. To avoid this, we need a constraint on the perturbation before adding it to the input.

${l}_p$ normIf we constrain 1 using the

By definition, $L_1$ norm is $||\textbf{w}||_1 = ∑_{i}^{n}|w_i|$ and hence can be represented as the greenish pyramid structure centred at point 4 and 5 respectively. $L_2$ norm is $||\textbf{w}||_2^2 = ∑_{i}^{n}w_i^2$ and can be represented by the pink and yellow spheres centred around point 4 and 5 respectively.

To explain how a perturbation can be constrained by Viewmaker Network creates image transformation, we will, for simplicity, assume our input data in some 3D space at point 5. Other descriptions of the points in the Fig.8 can be seen below followed by their use in the viewmaker network.

Point 1: Unconstrained perturbation Point 2: Perturbation constrained by $l_2$ norm Point 3: Perturbation constrained by $l_1$ norm Point 4: origin Point 5: input data point Point 6: input data point perturbed by an $l_1$ constrained perturbation Point 7: input data point perturbed by an $l_2$ constrained perturbation

We denote random unconstrained perturbation by point 1. From our knowledge of $l_p$ norms and pseudo code shown in Fig. 8, we know that these perturbations can be constrained by $l_1$ or $l_2$ norms i.e. within $l_1$ sphere (pink sphere around origin) or $l_2$ spheres(green pyramid structure around origin) respectively as shown in Fig. 9. These sphere represent the space (intutively) where we constraint perturbations we want to add to the original input data (images). Hence, in Viewmaker Network, when perturbation is constrained by an $l_1$ norm, intuitivetly, it implies that perturbation can not lie beyond the space of pink sphere.

From the psuedo code Step 3, we also see that perturbations are added to original images which is eventually used an as input to the encoder, i.e.

$$ X = X + P $$

where $X$ is the input original data and $P$ is the pertubation weighted by $l_1$ or $l_2$ norm. In Fig. 9, this is computed using simple vector additions. Hence, $P_5$ + $P_3$ = $P_6$ and $P_5$ + $P_2$ = $P_7$ where
$P_5$ is the original input data, $P_3$ and $P_2$ are $P_1$ (perturbation) weighted by $l_1$ and $l_2$ norm respectively, $P_6$ and $P_7$ are the resultant adverserially perturbed representation of original data $P_5$ constrained by $l_p$ norm.

So far we covered perturbations contstrained by $l_p$ norms but how do we know they would be useful. To ensure that perturbations are useful, there are three things to keep in mind:

  • Challenging: The perturations should be complex and strong enough so that the encoder learns useful representations.
  • Faithful: Perturbations shouldn’t make the encoder task impossible by destroying all features.
  • Stochastic: The method should generate a variety of perturbations Achieved by injecting random noise into viewmaker (so that the model learns a stochastic function that produces a different perturbation each time)

5.2. Visual Representation of Psuedo Code

Figure 9 shows how respective steps of the psuedo code allign with the block diagram of viewmaker network as whole. First an input data is sampled and a random noise/perturbation is generated. This random perturbation is then weighted by an $l_p$ norm to constraint the perturbations. It is important to constraint these perturbations so that they do not destroy the input data when added to original image. Finally this output is clamped so that each pixel value has a value between 0 and 1 for image representation.

Fig.9.Psuedo Code: Generate perturbations

7. Intro to Contrastive Learning

Contrastive self-supervised visual representation learning algorithms learns representations by maximizing the agreement between the different views of the same image and minimizing the agreements between views of different images via a contrastive loss in the latent space. SimCLR and InstDisc are two popular contrastive learning algorithms.(give references)

One of the major components of the contrastive learning frameworks is the data augmentation module which generates the various views of images. Traditionally, this module generates views from a predefined pool of augmentation which are designed by humans. Viewmaker Networks([1]) proposes an alternate data augmentation strategy which can be used for contrastive learning algorithms.

Fig.14: Contrastive Learning(representational image)

8. Viewmaker Architecture

The Viewmaker training setup requires the use of two neural networks. The first network is the Encoder network, which is the network that learns the image representations and solves the self-supervised task. For the experiments within this work, ResNet18 is used as the default architecture for the encoder. The Viewmaker network architecture modifies input images before sending them to the Encoder. The Viewmaker architecture was based off of a style transfer network established in previous works in the style transfer literature. The network is fully convolutional, contains 5 residual blocks, and includes no pooling. Uniform random noise is concatenated to the input and activations before each residual block.

During training, the Encoder and Viewmaker network are trained back-and-forth, similar to other adversarial approaches. The Viewmaker network requires the experimenter to select a self-supervised learning strategy, which can be arbitrarily selected from a variety of available self-supervised methods. The authors opt to use the SimCLR training object and the InstDisc training object. The following hyperparams were selected for each of the training objectives:

The following hyperparams were selected for each of the training objectives (toggle to see):
  1. SimCLR Specific Hyperparams
    • temperature = 0.07
  2. InstDisk
    • negatives = 4096
    • update rate = 0.5
  3. Other Hyperparams:
    • optimizer = SGD
    • Batch size = 256
    • Learning rate = 0.03
    • Momentum = 0.9
    • Weight decay = 1e-4
    • Epochs = 200


To evaluate the resultant encoder models, the authors use linear evaluation. Linear evaluation involves first removing the final layer of the encoder and freezing the remaining paramaters, and then training a logistic regression model using the frozen features. The goal of this strategy is to determine if the learned representations are linearly separable with respect to the true underlying classes, despite the fact that the encoder is not trained using those class label. The experimental setup for training the logistic regression model is selected as follows:

The following hyperparams were selected for each of the training objectives (toggle to see):
  1. Optimizer = SGD
  2. Learning rate = 0.01
  3. Momentum = 0.9
  4. Weight decay = 0
  5. Batch size = 128
  6. Epochs = 100
  7. Learning rate decay schedule:
    • epoch 60, decay by factor of 10
    • epoch 80, decay by factor of 10

Comparitive results

They compare their results with SimCLR (reference) and IntDisc (reference) training object, which uses expertly selected image augmentations.

Fig.10: SimCLR expert views transfer performance as compared to viewmaker learned views on CIFAR-10

8. Experiments

8.1 Collapse Problem in Self-Supervised Learning and


Since the beginning of the field of self-supervised visual representation learning, researchers have proposed various methods to overcome the problem of representational collapse (also known as complete collapse) i.e. the problem where all embedding vectors collapse to a trivial constant vector. Contrastive learning methods like SimCLR solve this problem by incorporating negative pairs of images.

Although complete collapse is avoided in contrastive learning methods (via the use of positive and negative pairs), a different form of collapse occurs and is often overlooked: dimensional collapse([3]). In dimensional collapse the embedding vectors rather than occupying the entire available embedding space, occupies only a lower-dimensional manifold within the available embedding space.

Fig.9 (a) Complete collapse and (b) Dimensional collapse

Implementation and Experiments

To observe and compare the dimensional collapse occuring in the SimCLR pre-training setup while using expert views and viewmaker views, we followed the procedure proposed in [3].

We trained two ResNet-18 models, one with expert views and the other with viewmaker views, on CIFAR-10, following the standard SimCLR training procedure given in [1]. After training, we passed the test set through both of the pretrained ResNet-18 models and obtained 2 sets of embedding vectors.

Let us denote the two embedding matrices as: $\mathbf{Z}_e$ : Embedding matrix obtained from expert views pretrained ResNet-18 $\mathbf{Z}_v$ : Embedding matrix obtained from viewmaker views pretrained ResNet-18

$$\mathbf{Z}_e , \mathbf{Z}_v \in \mathbb{R} ^{10000 \times 128}$$ [10000 is the number of images in the test set of CIFAR-10 and 128 is the output embedding dimension of ResNet-18]

We then compute the covariance matrix of the embedding layers of both the networks using the following formula:

$$\mathbf{C} = \frac{1}{N} \sum_{i=1}^{N} (\mathbf{z}_i - \mathbf{\bar z})^T(\mathbf{z}_i - \mathbf{\bar z})$$


$\mathbf{C}$ : Covariance matrix $N$ : Number of images in the test set of CIFAR-10(10000 images) $\mathbf{z}_i$ : $i^{th}$ row vector of $\mathbf{Z}$ ($\mathbf{Z}$ $\in$ {$\mathbf{Z}_e$, $\mathbf{Z}v$ }) and $\mathbf{\bar z} = \frac{1}{N} \sum{i=1}^{N} \mathbf{z}_i$

Thus, we obtain: $\mathbf{C}_e$ : Covariance matrix obtained from the embedding layer of expert views pretrained ResNet-18 and $\mathbf{C}_v$ : Covariance matrix obtained from the embedding layer of viewmaker views pretrained ResNet-18 $$\mathbf{C}_e, \mathbf{C}_v \in \mathbb{R} ^{128 \times 128}$$

We then perform the SVD(Singular Value Decomposition) on both $\mathbf{C}_e$ and $\mathbf{C}_v$ $$\mathbf{C} = \mathbf{U} \mathbf{S} \mathbf{V}^T, \mathbf{S} = diag(\sigma^k)$$

We can then plot the singular values thus obtained from both $\mathbf{C}_e$ and $\mathbf{C}_v$ in sorted order(descending) and logarithmic scale to obtain the following plot:

Fig.17: Singular value spectrum of the embedding spaces


The figure shows that a large number of the singular values of both covariance matrices are collapsing to zero, which represents the collapsing dimensions in the respective embedding spaces. $\mathbf{S}_e$ collapses to 24 dimensions(out of the available 128) and $\mathbf{S}_v$ collapses to 28 dimensions(out of the available 128).

[3] shows that a mechanism which causes dimensional collapse is strong augmentation. Although both methods results in dimensional collapse, Viewmaker views pretrained ResNet-18 had marginally less number of embedding dimensions that collapsed. We believe this can be attributed to the constrained perturbations resulting in comparably less strong augmentations.

It is known that neural networks that can generalize better to the test set are those that transform the data into low-dimensional manifolds. There is a proven tradeoff between generalization and class separability of representations w.r.t the dimensionality of the representations. We believe that marginally lesser number of collapsed dimensions of the Viewmaker views pretrained ResNet-18’s embedding layer helps improving the class separability of the representations but is hurting it’s generalization capability due to the above mentioned tradeoff, which is evident from the marginally lower accuracy of Viewmaker views pretrained ResNet-18 on transfer tasks (see section on viewmaker architecture)

8.2 Do Expert Views and Viewmaker Trained Networks Learn Similar Representations?


Contrastive self-supervised methods, such as those trained using “Expert Views”, exploit the fact that image semantics are invariant under a variety of image transformations. Given a set of transformations which are known to preserve image semantics, networks trained using this method can develop the weak underlying signal within a dataset and learn many relevant features for solving downstream image tasks. Networks trained with “Expert Views” typically use the following set of image transformations:

A = {Crop, Cutout, Noise, Rotation, Flipping, Resizing, Color Distortion, and Sobel Filtering}

In contrast to this, the Viewmaker method attempts to learn these image transformations. This creates an interesting situation, as both methods are attempting to learn features corresponding to image properties which are unchanged under various transformations. However, these methods rely on potentially disjoint sets of transformations. This situation naturally leads to the following question – do these views lead to models that learn similar representations of the data?

Developing similarity metrics for determining the similarity between models is a challenging problem. Recently, Centered Kernel Alignment (CKA) [^4] has emerged as a popular strategy for measuring the similarity between network architectures in the representation space. We use this metric to explore the question of whether or not Viewmaker and Expert Views trained models learn similar representations. Below, we develop the mathematical intuition behind this metric and then highlight a set of experiments which we conducted to answer our question.


CKA measures the similarity between two representations, $\mathbf{X} \in \mathbb{R}^{m\times p_1}$ and $\mathbf{Y} \in \mathbb{R}^{m\times p_2}$, where $m$ is number of examples, $p_1$ is the dimensionality of the representations in $\mathbf{X}$ and $p_2$ is the dimensionality of the representations in $\mathbf{Y}$. This metric is bound between 0 and 1, with 0 representing maximum dissimilarity and 1 representing maximum similarity. CKA satisfies two important propertis:

  1. Invariant to Orthogonal Transformations
  2. Invariance to Isotropic Scaling: i.e., $s(\mathbf{X}, \mathbf{Y}) = s(\alpha\mathbf{X}, \beta\mathbf{Y})$

CKA itself is dependent on the Hilbert-Schmidt Independence Criterion (HSIC), which only satisfies first property. HSIC is calculated as follows:

$$\texttt{HSIC}(\mathbf{K},\mathbf{L}) = \frac{1}{(n-1)^2}\texttt{tr}(\mathbf{K}\mathbf{H}\mathbf{L}\mathbf{H})$$

Where $\mathbf{K}_{ij} = k(\mathbf{X}_i, \mathbf{X}j) \in \mathbb{R}^{m \times m}$ and $\mathbf{L}{ij} = k(\mathbf{Y}_i, \mathbf{X}_j) \in \mathbb{R}^{m \times m}$ are a Gram matrices generated by a kernel function $k$, and $\mathbf{H} = \mathbf{I}_m - \frac{1}{m}\mathbf{J}_m$ is a kernel centering matrix. HSIC can be made invariant to isotropic scaling by normalizing the metric, which leads to the following definition for CKA:

$$\texttt{CKA}(\mathbf{K},\mathbf{L}) = \frac{\texttt{HSIC}(\mathbf{K},\mathbf{L})}{\sqrt{\texttt{HSIC}(\mathbf{K},\mathbf{K})\texttt{HSIC}(\mathbf{L},\mathbf{L})}}$$

For our experiments, we select the Radial Basis Function for our kernel function $k$, which is defined as: $$k(\mathbf{X}_i, \mathbf{X}_j) = \texttt{exp}(-\frac{||\mathbf{X}_i - \mathbf{X}_j ||^2}{2\sigma^2})$$


Fig.17: (Adrian please give a caption here)

To determine the relationship between the functions learned by Viewmaker and Expert Views, we calculate the CKA between 6 layers at increasing network depths across both models and within each model. Figure 17 highlights the three CKA heatmaps for the following comparisons:

  1. Viewmaker layers vs. Viewmaker layers
  2. Expert View layers vs. Expert View layers
  3. Viewmaker layers vs. Expert View layers

For every layer considered, we calculate the CKA with respect to every other layer (including the layer being considered). In the CKA heatmaps shown above, the shallowest layers are represented in the bottom left corner of the plot, and the deepest layers are represented in the top right corner of the plot.


A few interesting relationships can be observed from these plots. First, we find that Viewmaker and Expert Views learn similar features across the shallower layers but learn consideribly different features as the depth of the network increases. Second, the shallowest layers of Expert Views have a relatively high similarity with many of the deeper layers of Viewmaker. These two points suggest that Viewmaker networks learn a different function of the data when compared to Expert Views, which should be carefully considered when deciding to train with Viewmaker view over Expert views.

9. Conclusion

In this blog post we explore Viewmaker Networks, a strategy for learning image transformations and which can be used by non-domain experts. We make observations about the learned representations, exploring whether there dimensionality collapse inn the embedding space and how they compare to other strategies for selecting views.

Image Credits (toggle to see):
  1. [Fig.13](https://arxiv.org/abs/2110.09348)


Acknowledgment: This work was done with Adrian and Vaisakh as part of our collaboration at Machine Learning Collective. Hude thanks to Akshit for allowing me to use his GPU allowing me to run these experiments.

Srishti Yadav
Srishti Yadav
ML Researcher

My research interest include applying computationally intensive machine learning algorithm to image or text based data