CDC data are non-intuitive as it does not weight in the population size of each state. The videos show the per capita new case and new death so far. Seven day moving average is applied to smooth the data a bit.

# Author: samuel.cheng@ou.edu

KL divergence asymmetry pic.twitter.com/bvOAyCIMHw

— Ari Seff (@ari_seff) September 9, 2020

OpenAI released its latest language model. I found the training compute comparison facinating (~1000x BERT-base). The large model has 175B parameters. And some said it costed $5M to train. While it definiitely is impressive, I agree with Yannic that probably no “reasoning” is involved. It is very likely that the model “remembers” the data somehow and “recalls” the best matches from all training data.

The idea of contrastive learning has been around for a while. It was introduced for unsupervised/semi-supervised learning. When we only have unlabelled data, we would like to train a representation that groups similar data together. The way to do that in contrastive learning is to introduce perturbation to a target sample to generate positive samples. The perturbations can be translation, rotation, etc. And then we can treat all other samples as negative samples. The goal is to introduce a contrastive loss that pushs negative samples away from the target sample and pull the positive samples towards the target sample.

Even for the case of supervised learning, the contrastive learning step can be used as a pre-training step to train the entire network (excluding the last classification layer). After the pretraining, we can train the last layer with labelled data while keeping all the other layers fixed.

The main innovation in this work is that rather than treating all other samples as negative samples. It treats data with the same labels as positive samples as well.

# Motivation

- Want to verify if someone has used your dataset for training

# Idea

- Introduce extra (unreal) feature to data. For example, cat feature to cat image, dog feature to dog image
- Verify if dataset was used with hypothesis testing

# Comments

- It seems that it is training and testing the same classifier. If a different classifier is used to the marked data, not sure if the method will actually work
- The “radioactive” images actually look very bad
- The idea does not seem to be new. And the execution is quite doubtful. It reminds me watermarking techniques popular in early 2000 but it never seems to take off since it doesn’t really work.

# Ref

# Motivation:

- Most ML moderns become statistics once they are trained. Or one has to retrain the model to keep it adaptive to the new environment.

# Main idea:

- Instead of updating weights directly, update the rules (determined by the Hebbian coefficients) that update the weights

# Some details:

- Weights are updated by Hebbian ABCD model,
- Let be the vectorize Hebbian coefficients, , where and is a fitnness evalution of (This I am not completely certained how it is evaluated)

Ref: video, paper, twitter posts

Problem statement:

- Transformer is computationally expensive when we have too many keys and queries (need to compute dot product between each pair). It can be memory intensive as well depending on implementations.

Proposed solution:

- Group keys and queries into “buckets” first and only compute dot product of keys and queries in the same bucket.
- Bucket can be implemented as a form of “locality sensitive hashing”. Seriously, I always forget what LSH means. Can they create an even more obscure jargon? (fairly speaking, I think they definitely could)
- I think a simple example to under LSH is the binary case. Assumes two binary vector is close according to the Hamming distance. Then, for a say length-100 binary vector, we can sort them into 4 bins just according to the value of the first two bits. Actually, it is just coset and syndrome coding but with the parity check matrix “degenerated” (care nothing but just for the first two bits). If we assume the bits are independent, this degenerated choice is okay. Otherwise, we can just use a random coset as in Slepian Wolf coding.
- For the reformer case, LSH is implemented as projections of the keys and queries onto random vectors. The signs of the resulting projections will determine the bucket. This is exactly the generalization Slepian Wolf coding to continuous case.

- Since they don’t want to store the bucket projection vectors, they make the step reversible instead. It is basically the same as the lifting scheme in wavelet construction.

Being a Matlab long-term user, I have almost switched to Python completely. But I wasn’t paying attention of different reshape behavior of numpy vs Matlab. It wasted me a night to catch a nasty bug because of that. I always assumed that when I “vectorize” a matrix, it will expand along the row index first as in Matlab. It turns out that numpy’s default behavior is to expand the column index first. This is known as the ‘C’ order vs the Fortran order in Matlab. To override the default behavior, simply set order to ‘F’ in the argument. For example,

`z = x.reshape(3,4,order='F')`