Programming languages are a tool for thought
Programming languages are a tool for thought. Writing in Python is not purely a way for me to specify to the interpreter how I want my program to behave, it's also a useful medium for me to reason about and communicate programs.
How programming languages help you think
They map to program execution. Programming languages are the canonical specification of how execution of a program will happen, and therefore they're designed to map cleanly to how execution does actually happen (which is the thing we want to reason about). This isn't necessarily tied to any one programming language, but our mental model of execution originally came from reading and writing programs and thinking about the sequence of steps they will take and the effect on state.
They are built of abstractions, which help us reason with our tiny working memories. Humans can't keep many things in their head at once. While programming a video game, if you had to keep in mind every minute detail of the game (winding order of triangles, ...), it would be really freaking hard to move forward. But good software engineers know how to use abstractions, which allow them to bundle big chunks of logic into a box and then only think about the shape of the few boxes that are relevant to the problem at hand (forgetting all about the inside of the boxes and the boxes that don't matter right now).
They amortize the cost of building a deep understanding of computational primitives. When you first learned to program, you spent a fair amount of time bumping your figurative toe while figuring out how the programs you wrote translated into actual execution. Why isn't the variable I created in that function accessible outside of it? When is this loop going to stop looping? But over time you grew good intuitions that you can now wield to answer all sorts of questions and solve many types of problems. When you have some behavior you want an algorithm to exhibit, you only need to consider a very clearly defined set of possible primitives you can use, which makes considering the space of possible programs infinitely easier.
Mechanistic interpretability could really use this something like this!
The field of mechanistic interpretability aims to figure out the algorithms that neural networks are implementing. It's pretty clear at this point that neural networks are in fact learning algorithms that can be understood and are worth figuring out.
Here's how I see the case for developing representations of neural network computation as a tool for thought:
- Neural networks are implementing algorithms that are sometimes quite complex. There're a lot of things going on at once and it's hard to understand.
- The algorithms that neural networks learn are completely different from the types of algorithms programmers are comfortable with. Neural networks perform addition not by adding each digit together and performing carries, but by doing math with helices! We need a new set of mental model primitives to be able to reason about and communicate what models are doing more effectively.
This is hard for several reasons
But programming languages are designed by humans to allow abstraction and map cleanly to execution, the structure of which we also designed! Why should we expect there to exist nice representations for neural network computation waiting to be found that will expand our mental reach?
I don't know if we should đ€·. Here are some of the things that make this hard:
- Neural Networks are super expressive. While we've designed programming languages to achieve their power through composability of a small set of clearly defined primitives, it's unclear if learned neural networks converge to something like this.
- Even if there are good primitives, they're not going to look like the primitives we know and love (if statements, discrete variable assignments, ...), and they're not going to compose the same way either. Neural network algorithms are extremely parallel, and often consist of many smaller independent algorithms ensembled togetherâwhat kind of representation will help use reason about that? Neural network algorithms are generally smooth, but also contain nonlinearitiesâwhat's the right level of abstraction here?
But I think it's somewhat possible
I think there is some hope:
- There do seem to be certain recurring patterns ("motifs"). There are certain geometries used to encode particular types of information that keep popping up. These geometries are learned by models partially because they support nice operations. Perhaps these operations are a good starting place for inspiration for possible primitives. Furthermore, the field has reverse engineered a wide selection of circuits that can be used for inspiration as well as test bed while designing such a representation.
- There's a tradeoff between how high level our primitives areâyielding more interpretabilityâand how much they can cover of the expressivity that neural networks provide. But perhaps we can have our muffin and eat it too by borrowing a trick from compilers and designing multiple representations at different levels of abstraction. The lower levels are closer to the literal operations the network is performing (the matmuls, etc.), and hence can probably be used to represent most/all of the model's computation. As a tradeoff, they won't be as enlightening as the higher levels. But for the bits of network computation that do conform to computational primitives we understand, we can employ a higher level primitive that is more understandable.