- Transformer is computationally expensive when we have too many keys and queries (need to compute dot product between each pair). It can be memory intensive as well depending on implementations.
- Group keys and queries into “buckets” first and only compute dot product of keys and queries in the same bucket.
- Bucket can be implemented as a form of “locality sensitive hashing”. Seriously, I always forget what LSH means. Can they create an even more obscure jargon? (fairly speaking, I think they definitely could)
- I think a simple example to under LSH is the binary case. Assumes two binary vector is close according to the Hamming distance. Then, for a say length-100 binary vector, we can sort them into 4 bins just according to the value of the first two bits. Actually, it is just coset and syndrome coding but with the parity check matrix “degenerated” (care nothing but just for the first two bits). If we assume the bits are independent, this degenerated choice is okay. Otherwise, we can just use a random coset as in Slepian Wolf coding.
- For the reformer case, LSH is implemented as projections of the keys and queries onto random vectors. The signs of the resulting projections will determine the bucket. This is exactly the generalization Slepian Wolf coding to continuous case.
- Since they don’t want to store the bucket projection vectors, they make the step reversible instead. It is basically the same as the lifting scheme in wavelet construction.