video and paper.


  • Project embedding to lower dimension to save computational complexity and space
  • Some gain in speed but doesn’t look too significant. Tradeoff in performance seems larger than claimed
  • Theorem 1 based on JL-lemma did not used properties of attention itself. It seems that the same argument can be used to anywhere (besides attention). The theorem itself seems to be a bit a stretch
  • With the same goal of speeding up transformer, the “kernelized transformer” appears to be a better work

Leave a Reply

Copyright OU-Tulsa Lab of Image and Information Processing 2021
Tech Nerd theme designed by Siteturner