Remarks:
- Project embedding to lower dimension to save computational complexity and space
- Some gain in speed but doesn’t look too significant. Tradeoff in performance seems larger than claimed
- Theorem 1 based on JL-lemma did not used properties of attention itself. It seems that the same argument can be used to anywhere (besides attention). The theorem itself seems to be a bit a stretch
- With the same goal of speeding up transformer, the “kernelized transformer” appears to be a better work