## PointNet and SENet

Came across two different network architectures that I think can discuss together as I feel that they share a similar core idea. Extract and leverage some global information from the data with global pooling.

## PointNet

The objective of PointNet is to classify each voxel of a point cloud and detect the potential object described by the point cloud. Consequently, each voxel can belong to a different semantic class but they all share the same object class.

One important property of the point cloud is that all points are not in any particular order. Therefore, it does not make too much sense to train a convolutional layer or fully connected layer to intermix the point. The simplest reasonable operation to combine all information is just pooling (max or average). For PointNet, each point is first individually processed and then a global feature is generated using max pooling. The object described by the point cloud is also classified by the global feature.

And the input transform and feature transform in the above figure are trainable and will be adapted to the input. This serves the purpose of aligning the point cloud before classification. For example, the T-Net in the input transform is elaborated as shown below.

To classify individual voxel, the global feature is combined with the local feature also generated earlier in the classification network. The combined feature of each voxel will be individually processed into point features, which are then used to classify the semantic class of the voxel.

## SENet

SENet was the winner of the classification competition of ImageNet competition in 2017. It reduces the top-5 error rate to 2.251% from the prior 2.991%. The key contribution of SENet is the introduction of the SENet module as follows

The basic idea is very simple. For a feature tensor with $C$ channel, we want to adaptively adjust the contribution from each channel through training. Since there is no restriction in the order of the data inside the channel, the most rational thing to summarize that information is simply through pooling. For SE Net module, it is simply done with an average pooling (squeezing). The “squeezed” data are then used to estimate the contribution of each channel (different levels of “excitation”). The computed weights are then used to scale the values of each channel. This SE module can be applied in literally everywhere. For example, it can be combined with inception module to form SE-inception module or ResNet module to form SE-ResNet module as shown below.

## GPT3

OpenAI released its latest language model. I found the training compute comparison facinating (~1000x BERT-base). The large model has 175B parameters. And some said it costed \$5M to train. While it definiitely is impressive, I agree with Yannic that probably no “reasoning” is involved. It is very likely that the model “remembers” the data somehow and “recalls” the best matches from all training data.

Ref: video, paper

## Supervised contrastive learning

The idea of contrastive learning has been around for a while. It was introduced for unsupervised/semi-supervised learning. When we only have unlabelled data, we would like to train a representation that groups similar data together. The way to do that in contrastive learning is to introduce perturbation to a target sample to generate positive samples. The perturbations can be translation, rotation, etc. And then we can treat all other samples as negative samples. The goal is to introduce a contrastive loss that pushs negative samples away from the target sample and pull the positive samples towards the target sample.

Even for the case of supervised learning, the contrastive learning step can be used as a pre-training step to train the entire network (excluding the last classification layer). After the pretraining, we can train the last layer with labelled data while keeping all the other layers fixed.

The main innovation in this work is that rather than treating all other samples as negative samples. It treats data with the same labels as positive samples as well.

Ref: paper, video

# Motivation

• Want to verify if someone has used your dataset for training

# Idea

• Introduce extra (unreal) feature to data. For example, cat feature to cat image, dog feature to dog image
• Verify if dataset was used with hypothesis testing

• It seems that it is training and testing the same classifier. If a different classifier is used to the marked data, not sure if the method will actually work
• The idea does not seem to be new. And the execution is quite doubtful. It reminds me watermarking techniques popular in early 2000 but it never seems to take off since it doesn’t really work.

# Motivation:

• Most ML moderns become statistics once they are trained. Or one has to retrain the model to keep it adaptive to the new environment.

# Main idea:

• Instead of updating weights directly, update the rules (determined by the Hebbian coefficients) that update the weights

# Some details:

• Weights are updated by Hebbian ABCD model, $\Delta w_{i,j} = \eta_w (A_w o_i o_j + B_w o_i +C_w o_j + D_w)$
• Let $bf h$ be the vectorize Hebbian coefficients, ${\bf h}_{t+1} \leftarrow h_t + \frac{\alpha} {n \sigma} \sum_{i=1}^n F_i ({\bf h}_t + \Delta {\bf h}_i)$, where $\Delta {\bf h}_i \sim \mathcal{N} ({\bf 0}, \sigma{\bf I})$ and  $F_i$ is a fitnness evalution of ${\bf h}_t + \Delta {\bf h}_i$ (This I am not completely certained how it is evaluated)

## Reformer

Problem statement:

• Transformer is computationally expensive when we have too many keys and queries (need to compute dot product between each pair). It can be memory intensive as well depending on implementations.

Proposed solution:

• Group keys and queries into “buckets” first and only compute dot product of keys and queries in the same bucket.
• Bucket can be implemented as a form of “locality sensitive hashing”. Seriously, I always forget what LSH means. Can they create an even more obscure jargon? (fairly speaking, I think they definitely could)
• I think a simple example to under LSH is the binary case. Assumes two binary vector is close according to the Hamming distance. Then, for a say length-100 binary vector, we can sort them into 4 bins just according to the value of the first two bits. Actually, it is just coset and syndrome coding but with the parity check matrix “degenerated” (care nothing but just for the first two bits). If we assume the bits are independent, this degenerated choice is okay. Otherwise, we can just use a random coset as in Slepian Wolf coding.
• For the reformer case, LSH is implemented as projections of the keys and queries onto random vectors. The signs of the resulting projections will determine the bucket. This is exactly the generalization Slepian Wolf coding to continuous case.
• Since they don’t want to store the bucket projection vectors, they make the step reversible instead. It is basically the same as the lifting scheme in wavelet construction.

Ref: Video, code, paper

## BYOL

It is kind of mysterious that this works without using negative samples for self learning. See video and paper

• The main idea is to train a representation network and a classifier so that the latter will predict the representation of an augmented data input.
• The representation network for the augmented data has moving average parameter of the current representation. Similar tricks have been used in deep reinforcement learning
• It is indeed quite surprising that this works without negative samples. Because there is nothing in the above model that avoids converging to trivial solution (everything maps to a constant)
• Experimental results look good. But also may not be accounted for too much. Their implementation for some older approaches have way higher prediction performance. And they pulled numbers from papers (reasonable tho) for comparison. Approach is probably on par and without negative samples, they can train with a smaller batch size
• They are using 512 TPUs for training for 7 hours…

## Linformer

video and paper.

Remarks:

• Project embedding to lower dimension to save computational complexity and space
• Some gain in speed but doesn’t look too significant. Tradeoff in performance seems larger than claimed
• Theorem 1 based on JL-lemma did not used properties of attention itself. It seems that the same argument can be used to anywhere (besides attention). The theorem itself seems to be a bit a stretch
• With the same goal of speeding up transformer, the “kernelized transformer” appears to be a better work

## Kernelizing transformer

Transformers are RNNs (paper) and (video):

Remarks:

• Kernalized transformer is not on par with the original transformer but can be much faster (x1000 times for some applications)
• Kernalized transformer can be modelled as RNN and can help further speed up inference time.