Deep Learning

Radioactivate data tracing through training

September 6, 2020 by samuel.cheng@ou.edu - No comments

Motivation

Want to verify if someone has used your dataset for training

Idea

Introduce extra (unreal) feature to data. For example, cat feature to cat image, dog feature to dog image
Verify if dataset was used with hypothesis testing

Comments

It seems that it is training and testing the same classifier. If a different classifier is used to the marked data, not sure if the method will actually work
The “radioactive” images actually look very bad
The idea does not seem to be new. And the execution is quite doubtful. It reminds me watermarking techniques popular in early 2000 but it never seems to take off since it doesn’t really work.

Ref

Video, paper

Deep Learning

Meta-Learning through Hebbian Plasticity in Random Networks

August 25, 2020 by samuel.cheng@ou.edu - No comments

Motivation:

Most ML moderns become statistics once they are trained. Or one has to retrain the model to keep it adaptive to the new environment.

Main idea:

Instead of updating weights directly, update the rules (determined by the Hebbian coefficients) that update the weights

Some details:

Weights are updated by Hebbian ABCD model, $latex \Delta w_{i,j} = \eta_w (A_w o_i o_j + B_w o_i +C_w o_j + D_w)$
Let $latex bf h$ be the vectorize Hebbian coefficients, $latex {\bf h}_{t+1} \leftarrow h_t + \frac{\alpha} {n \sigma} \sum_{i=1}^n F_i ({\bf h}_t + \Delta {\bf h}_i) $, where $latex \Delta {\bf h}_i \sim \mathcal{N} ({\bf 0}, \sigma{\bf I})$ and $latex F_i$ is a fitnness evalution of $latex {\bf h}_t + \Delta {\bf h}_i$ (This I am not completely certained how it is evaluated)

Ref: video, paper, twitter posts

Deep Learning

Reformer

August 19, 2020 by samuel.cheng@ou.edu - No comments

Problem statement:

Transformer is computationally expensive when we have too many keys and queries (need to compute dot product between each pair). It can be memory intensive as well depending on implementations.

Proposed solution:

Group keys and queries into “buckets” first and only compute dot product of keys and queries in the same bucket.
Bucket can be implemented as a form of “locality sensitive hashing”. Seriously, I always forget what LSH means. Can they create an even more obscure jargon? (fairly speaking, I think they definitely could)
- I think a simple example to under LSH is the binary case. Assumes two binary vector is close according to the Hamming distance. Then, for a say length-100 binary vector, we can sort them into 4 bins just according to the value of the first two bits. Actually, it is just coset and syndrome coding but with the parity check matrix “degenerated” (care nothing but just for the first two bits). If we assume the bits are independent, this degenerated choice is okay. Otherwise, we can just use a random coset as in Slepian Wolf coding.
- For the reformer case, LSH is implemented as projections of the keys and queries onto random vectors. The signs of the resulting projections will determine the bucket. This is exactly the generalization Slepian Wolf coding to continuous case.
Since they don’t want to store the bucket projection vectors, they make the step reversible instead. It is basically the same as the lifting scheme in wavelet construction.

Ref: Video, code, paper

Uncategorized

NVIDIA’s AI Recreated PacMan

August 15, 2020 by samuel.cheng@ou.edu - No comments

As titled, it is amazing yet scary. The blog post, site and paper .

teaching

Numpy vs Matlab reshape

August 3, 2020 by samuel.cheng@ou.edu - No comments

Being a Matlab long-term user, I have almost switched to Python completely. But I wasn’t paying attention of different reshape behavior of numpy vs Matlab. It wasted me a night to catch a nasty bug because of that. I always assumed that when I “vectorize” a matrix, it will expand along the row index first as in Matlab. It turns out that numpy’s default behavior is to expand the column index first. This is known as the ‘C’ order vs the Fortran order in Matlab. To override the default behavior, simply set order to ‘F’ in the argument. For example,

z = x.reshape(3,4,order='F')

robotic

Underactuated robotics

July 15, 2020 by samuel.cheng@ou.edu - No comments

Watched the first lecture of underactuated robotics by Prof Tedrake. It was great. His lecture note/book is available online. And the example code is directly available at colab.

So what is underactuated robotics? Consider a standard manipulator equation with state $latex q$

$latex M(q) \dot{q}+C(q,\dot{q}) \dot{q} = \tau_g(q) + B(q) u,$

where L.H.S. are the force terms, R.H.S. are the “Ma” terms, $M(q)$ is mass/inertia matrix and positive definite, $latex u$ is the control input, and $latex B(q)$ maps the control input to $latex q$.

We can rearrange the above to

$latex \ddot{q}= M(q)^{-1} [ \tau_g(q) + B(q) u – C(q,\dot{q} )\dot{q}] =\underset{f_1(q,\dot{q})}{\underbrace{M(q)^{-1}[ \tau_g(q) – C(q,\dot{q} )\dot{q}]}} +\underset{f_2(q,\dot{q})}{\underbrace{M(q)^{-1} B(q) }}u .$

Note that if $latex f_2(q,\dot{q})$ has full row rank (or simply $latex B(q)$ has full row rank since $latex M(q)$ is positive definite and hence full-rank), then for any desired $latex \ddot{q}^d$, we can achieve that by picking $latex u$ as

$latex u = f_2^{\dagger} (q,\dot{q}) (\ddot{q}^{d} – f_1(q,\dot{q})),$ where $latex f_2^{\dagger}$ is the pseudo-inverse of $latex f_2$. We say such robotic system is fully actuated.

On the other hand, if $latex f_2(q,\dot{q})$ does not have full row rank, the above trivial controller will not work. We then have a much more challenging and interesting scenario. And we say the robotic system is underactuated.

Uncategorized

NVAE

July 12, 2020 by samuel.cheng@ou.edu - No comments

Paper, video

This one generates high resolution images with hierachical variational autoencoder

Uncategorized

NLP scholar

July 9, 2020 by samuel.cheng@ou.edu - No comments

A very nice visualization to explore NLP papers

Deep Learning

Linformer

July 6, 2020 by samuel.cheng@ou.edu - No comments

video and paper .

Remarks:

Project embedding to lower dimension to save computational complexity and space
Some gain in speed but doesn’t look too significant. Tradeoff in performance seems larger than claimed
Theorem 1 based on JL-lemma did not used properties of attention itself. It seems that the same argument can be used to anywhere (besides attention). The theorem itself seems to be a bit a stretch
With the same goal of speeding up transformer, the “kernelized transformer” appears to be a better work

Deep Learning

Kernelizing transformer

July 4, 2020 by samuel.cheng@ou.edu - No comments

Transformers are RNNs (paper ) and (video):

Remarks:

Kernalized transformer is not on par with the original transformer but can be much faster (x1000 times for some applications)
Kernalized transformer can be modelled as RNN and can help further speed up inference time.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Author: samuel.cheng@ou.edu

Motivation

Idea

Comments

Ref

Motivation:

Main idea:

Some details: