Linformer

Remarks:

Project embedding to lower dimension to save computational complexity and space
Some gain in speed but doesn’t look too significant. Tradeoff in performance seems larger than claimed
Theorem 1 based on JL-lemma did not used properties of attention itself. It seems that the same argument can be used to anywhere (besides attention). The theorem itself seems to be a bit a stretch
With the same goal of speeding up transformer, the “kernelized transformer” appears to be a better work

Leave a Reply Cancel reply