Reflections on The Hardware Lottery
I completed reading the widely acclaimed paper by Sara Hooker i.e. "The Hardware Lottery". This post is my commentary on this paper. This paper is not like a usual scientific paper containing a bunch of mathematical expressions, and experimental results, but rather more of an essay.
Sara argues that the ideas which succeed in computer science are not necessarily successful because the idea in itself is orders of magnitude better than other ideas in the same space, but sometimes it won what Sara calls "the hardware lottery" i.e. the idea at that current point of time was able to closely align with the "kind" of hardware we had.
Also, there is another point Sara emphasizes in the paper. She says the hardware designers never actively optimize the design for a given approach unless there is a clear commercial positioning of it. This is largely because of the entry barrier for making new hardware; it takes significant investment of time, and capital to come up with a new chip architecture. Why would the big-corporations plan to take this risk unless they see a clear commercial positioning of the idea? The safest bet in this case is to make "task-agnostic" hardware as Sara calls it. This ensures you are not optimizing the hardware designs for some niche idea which might not stand the test of time, and might not have a clear way of economic impact. Once the effectiveness of the idea is well established, just like the state of deep neural networks, chip designs will start aligning themselves to such workloads as the incentives become much clearer.
When designing the next generation of hardware, designers want to take a bet on an idea which will stay there for at least a few years before getting obsolete. Now, the safest bet is usually what the community has a general consensus on. But, the caveat is paradigm shifts always come through radical ideas which are orthogonal to the current "consensus" in the community. As a result, most researchers are incentivized to conform to the "status quo" of what the basic building block should be rather than exploring an entirely new approach of doing something. This shrinks the "space of exploration" as researchers are incentivized to do incremental tweaks, rather than changing the "foundational blocks" on which the current approach stands. For example, matrix multiplications are the basic building blocks of the current dominant approach for solving intelligence, and all the AI hardware is more or less optimized for that. A researcher might feel more friction to build some architecture whose foundational block is not matrix multiplication.
Sara also talks about how "research directions" are limited by the software stack. Absence of the right kind of software abstractions creates more friction in pursuing certain ideas. She talks about it not that elaborately like hardware, but definitely touches upon it towards the end of the paper. One concrete example cited in the paper is the following: Languages like Perl and Prolog favoured the symbolic approaches of AI and not the connectionist approaches like deep neural networks. But, right now the existence of right abstractions like Python and deep learning frameworks like PyTorch lowers the entry barrier to run experiments which is an insane leverage. Similarly, right now, there can be some promising research directions for which no mature software abstractions exist which hinder their progress.
If we build on the line of thought laid down by Sara, this leads to the following conclusion. An idea at an extremely nascent stage for which no right kind of hardware and software abstractions exist can only succeed if it can provide a rigorous proof that it is better than current approaches, and the implementation would simply lead to the manifestation of the improvement. You cannot say let's try this at scale, and see if it works. Because, that would not lead to incentives being aligned to make the hardware exist in the first place. This basically is a chicken-and-egg problem: you need the hardware to prove the idea works at scale, but you can't get the hardware until you prove the idea works at scale. During my conversations with Gemini, it provided an instance that even theoretical rigour is not enough to attract the interest of the community. The mathematical foundations of "Universal Approximation" i.e. a neural network large enough can approximate any function of arbitrary complexity existed long ago, but it didn't matter to the community until the "AlexNet moment".
As per Sara, Deep Learning won the hardware lottery by fluke. Hardware offering massive parallelism was initially built for graphics rendering and gaming. But it was repurposed in early 2000s to show some empirical results which made it undeniable that the connectionist approach is clearly better and works. In the history of deep learning, it was the AlexNet moment. After this the hardware started aligning more and more towards this approach as designers clearly saw the viability of the connectionist approach.
When you think about it, it is really fascinating how much the presence of the right kind of tools and infrastructure incentivizes progress in a domain. After reading the paper, I had a long discussion with Gemini. I have attached the public link of my conversation thread with Gemini below.