Each time I teach neural nets to an engineer, there's only a 50% chance they can write down the chain rule. Colah's blog on backprop used to be my favorite resource to leave them with (https://colah.github.io/posts/2015-08-Backprop).
The explanation of the calculus in this tool is equally fantastic. And the art is very cute.
There are many ways to skin a cat, of course, but this is as good a tutorial as I've seen for getting you through backprop as fast as possible.
Given the current state of automatic differentiation I'm not so sure it's even necessary or particularly useful to focus on backpropagation any more.
While backprop has major historic significance, in the end it's essentially just a pure calculation which no longer needs to be done by hand.
Don't get me wrong, I still believe that understanding the gradient is hugely important, and conceptually it will always be essential to understand that one is optimizing a neural network by taking the derivative of the loss function, but backprop is not necessary nor is it particularly useful for modern neural networks (nobody is computing gradients by hand for transformers).
IMHO a better approach is to focus on a tool like JAX where taking a derivative is abstracted away cleanly enough, but at the same time you remain fully aware of all the calculus that is being done.
Especially for programmers, it's better to look at Neural Networks as just a specific application of Differentiable Programing. This makes them both easier to understand and also enables the learner to open a much broader class of problems they can solve with the same tools.
Backpropagation is a particular implementation of reverse mode auto-differentiation, and it is the basis for all implementaions of DL models. It is very strange for me to read this as though it is very obvious and commonly accepted fact, which I don't think it is.
> to read this as though it is very obvious and commonly accepted fact
I'm not entirely sure what you're referring to by "this" but assuming you mean my comment, I think what I'm saying is very much up for debate and not an "obvious and commonly accepted fact". Karpathy has a very reasonably argument that directly disagrees with what I'm suggesting [0]. Of course he also agrees that in practice nobody will every use backprop directly.
Whether it's JAX, TF, PyTorch, etc the chain rule will be applied for you. I'm arguing that I think it's helpful to not have to worry about the details of how your derivative is being computed, and rather build an intuition about using derivatives as an abstraction. To be fair I think Karpathy is correct for people who are going to be learning to explicitly be experts in Neural Networks.
My point is more that given how powerful our tools today are for computing derivatives (I think JAX/Autograd have improved since Karpathy wrote that article), it's better to teach programmers to learn think of derivatives, gradients, hessians etc as high level abstractions. Worrying less about how to compute them and more about how to use them. In this way thinking about modeling doesn't need to be restricted to strictly NNs, but rather use NNs and example and then demonstrate to the student that they are free to build any model by defining how the model predicts, scoring the prediction and using the tools of calculus to answer other common questions you might have.
edit: a good analogy is logic programming and backtracking/unification. The entire point of logic programming is to abstract away backtracking. Sure experts in Prolog do need to understand backtracking, but it's more helpful to get beginners understanding how Prolog behaves than understand the details of backtracking.
but with backprop you do not worry about computing derivatives by hand. backprop and AD in general means you do not have to do that. maybe one of us is misunderstanding the other
i am saying that if you want to work with ML algorithms on a more deeper level you must learn backprop
if you want to implement some models on the other hand, you can just follow a recipe approach
> there's only a 50% chance they can write down the chain rule
Why should I, though? I remember the concept from calculus. I know pytorch keeps track of the various stuff I do to a vector and calculates a gradient based on it. What more do I need to know when all I want to do is to play with applications, not implement backprop myself?
Certainly there's a lot you can do without understanding backprop - you can train pre-made architectures, you can put pre-made layers together to build your own architecture, you can tweak hyperparameters and improve your model's accuracy, and so on. But I also think you will eventually run into a problem that would be much easier to debug if you understand backprop. If your model isn't learning, and your tensorboard graphs show your gradient magnitude is through the roof, it'll be much easier to track that down if you have a strong conceptual model of how gradients are calculated and how they flow backwards through the network.
if you don't understand chain rule then you dont understand backprop, which means you do not really understand how deep learning works. at most you can follow recipes cook book style. it is kind of how one can make a website without a deep understanding of networking
This kind of argument can always go further by removing abstraction… you could argue that you don’t understand the chain rule without understanding the quotient and product rules which are used to prove the chain rule, and then once further with the epsilon-delta proofs required to get limits going for the derivative in the first place, and then continue until we reach some axioms.
I think you can accept that the chain rule is a thing, without understanding it, and then go further to understand its application to backprop.
Here I disagree with you pretty strongly. Once someone is comfortable with differentiable programming it's much more obvious how to build and optimize any type of model.
People should be more concerned about when to use derivatives, gradients, hessians, Laplace approximation etc rather than worry about the implementation details of these tools.
Abstraction can also aid depth of understanding. I know plenty of people who can implement backprop, but then don't understand how to estimate parameter uncertainty from the Hessian. The latter is much more important for general model building.
i am not sure what you are disagreeing with. chain rule is basic calculus that precedes understanding hessians. my argument is, if you can not understand what the chain rule is, you will not understand more complicated mathematics in ML. do you think i am wrong ?
EDIT: also uncertainty estimation is the stuff of probabalistic approach to ML. i would say that people who do probabalistic ML are quite mathematically capable (at least to my experience)
> chain rule is basic calculus that precedes understanding hessians.
It doesn't have to be that way. The hessian is an abstract idea and the chain rule and more specifically backpropagation are methods of computing the results for an abstract idea. When I want the hessian I want a matrix of second order partial derivatives, I'm not interested in how those are computed.
For a more concrete example, would you say that using the quantile function for the normal distribution requires you to be able to implement it from scratch?
There are many, very smart, very knowledgeable people that correctly use the normal quantile function (inverse CDF) every day for essential quantitative computation that have absolutely no idea how to implement the inverse error function (an essential part of the normal quantile). Would you say that you don't really know statistics if you can't do this? That a beginner must understand the implementation details of the inverse error function before making any claims about normal quantiles? I myself would absolutely need to pull up a copy of Numerical Recipes to do this. It would be, in my opinion, ludicrous to say that anyone wanting to write statistical code should understand and be able to implement the normal quantile function. Maybe in 1970 that was true, but we have software to abstract that out for us.
The same is becoming true of backprop. I can simply call jax.grad on my implementation of loss of the forward pass of the NN I'm interested in and get the gradient of that function, the same way I can call scipy.stats.norm.ppf to get that quantile for a normal. All that is important is that you understand what the quantile function of the normal distribution means for you to use it correctly, and again I suspect there are many practicing statisticians that don't know how to implement this.
And to give you a bit of context, my view on this has developed from working with many people who can pass a calculus exam and perform the necessarily steps to compute a derivative, but yet have almost no intuition about what a derivative means and how to use it and reason about it. Calculus historically focused on computation over intuition because that was what was needed to do practical work with calculus. Today the computation can take second place to the intuition because we have powerful tools that can take care of all the computation for you.
> Today the computation can take second place to the intuition because we have powerful tools that can take care of all the computation for you.
and that tool is backprop. if you do not understand what the chain rule is and what it is doing, that tool will be magic to you and you are blindly trusting its correctness. seeing that alot of risk is involved in using AI models in real life, blindly trusting your model is not a good approach
i agree that simply regurgitating rules of calculus is pointless to understanding. but thats definitely not what i mean when i talk about the need to understand the chain rule
ML is a mathematically intensive subject. there is no going around this fact
but that's my point. knowing how to compile a program does not make me a compiler engineer. in that sense feel free to use ML tools, but don't be fooled into thinking you will get a job as an ML engineer if you do not know what the chain rule is, or why we need to take a derivative in order to optimise a loss function. in fact, don't even be fooled into thinking you will get into a ML uni degree if you don't know what the chain rule is. i actually don't understand what is the problem. spend 10 minutes reading up on it and i am sure you will get it. i think an unwarranted phobia of mathematics is what is at play here
If by understand, you mean understand and not regurgitate it when asked as a trivia question - I agree with you. However, there are different interpretations of the chain rule.
After reading the tutorial I was not sure why it was called backpropagation. Thanks for the Colah's blog link. I think the two links together explains the things beautifully. Backpropagation seems just like an optimization for the gradient descent calculation as per Colah's blog.
Any recommendations for a 101 book for neural nets for someone who is "just a programmer"? OP's tutorial is quite nice, but I love to read books and find it easier to learn from them.
This is a valuable take. Fastai was very frustrating for me because I wanted to understand the internals. I ended up not finishing it, so take my opinion with a grain of salt.
I tried Fast AI, but it seems to be trying too hard to take out the math, which oddly for me (as a STEM grad) makes it much more difficult to understand.
Had to stop when I saw him using Excel spreadsheets to explain convolution.
I don't think it is a good idea to describe neural networks as a large graph of neurons interacting with each other. It is not really helpful to understand what is going on inside.
It is more useful to understand them as a series of transforms that bend and fold the input space, in order to place pairs of similar items close to each other. I would like to see people trying to illustrate that instead.
It also has the benefit of making the connection with linear algebra much easier to understand.
You really find large n-dimensional transforms easier to reason about and visualise, as opposed to layers of neurons with connections? You don’t find it much more intuitive to see it as a graph once you start adding recurrence, convolutions, sparsity, dropout, connections across multiple layers, etc., let alone coming up with new concepts?
I think it’s useful to understand it in both ways, but our intuitions about transforms are largely useless when the number of dimensions is high enough.
I think understanding how neural networks work is easiest if you think of them as networks. Reasoning about why they work is a lot easier thinking about them as transformations. It's not like you're actually picturing all the parameters of a nontrivial network one way or the other.
It's good to have both perspectives. Ideally you learn the layers-of-transforms version alongside the styled graph-of-neurons version. If you had to only pick one, which one you learn would depend a lot on what kind of student you are and what your goals are. I think the layers-of-transforms version is "less wrong" in general, but probably harder to understand, so it's maybe better if you had to learn just one.
Not the person you are answering to, but I think it's all about the level of abstraction you want to reason at. I didn't grok neural networks until I visualized the transformations that were happening in a very simple network. Once that made sense, I could start thinking in terms of layers.
Neural Networks are a graph more specifically a Weighted Directed Graph.
They are also very much modeled after the brain, more specifically they originate from a 1943 paper by neurophysiologist Warren McCulloch and mathematician Walter Pitts who described how neurons in the brain might work by modeling a simple neural network.
Of course it’s not an accurate model, but it very much is based on early understanding of biological neurons.
The 1943 NN compu sci model corresponds to a 1943 brain model, sure. NNs in an ML sense arent this model.
What does any modern NN compu-sci model correspond to?
You're making a genetic fallacy here: that since the origin of X had property P, so must X. Eg., marriage was historically a legal contract of ownership, so today it is a legal contract of ownership.
The brain has nothing akin to back-propagation, nor supervised training. "Neurones" are not "data about house prices". According to any even highly simplified neurological model, a compu sci model of this, would compute with activations in a radically different manner than a dot-product over domain-specific data.
They are also not a weighted directed graph. In a sense almost any mathematical function is a weighted graph if one takes edges to be, eg., general function composition; and nodes to be input/output domains of those functions.
However to say a "NN is a network" is presumably meant to be more significant than this trivial observation that essentially "all functions" are.
In particular everything which distinguishes a NN from any mathematical functions are operations such as supervised training, backprop, regularization, pooling, etc. And these are not expressible as functions within this graph, but they are functions over the graph.
So any actual "network diagram" presented isn't a network. Its a bad diagram of a mathematical function which also includes functions over this diagram.
In any case, a neural network -- in the ML sense -- is neither neural nor a network. It is a function of a mathematical function that has no isomorphism to anything in the brain.
Back propagation etc isn’t a property of a NN it’s a means to generate a NN. You can argue there is some quantitative difference between modern neutral networks but they hadn’t reached the point of trying to train a neural network, it was much more abstract.
Also, the brain does actually use a form of supervised learning. Pain for example is effectively a form of input labeling and lack of pain the wanted output.
If the neural network is just the piece-wise linear regression function that the NN algorithm generates -- then its just a network like any other mathematical function.
And it's a pretty crappy mathematical function, with zero homology to the brain. Nothing in the brain is piece-wise linear.
I don't know what we're describing by "NN"; which is, in large part, exactly my point.
I was taking you to be flip-flopping on backwards/forwards feeding; rather than to be talking about recurrent loops.
What is meant to be meaningfully a graph? What definition of "neural network" are you using?
The NN model itself, even recurrent, is just a regression function. In the sense that a Turing machine is just an implementation of a function from the Naturals to Naturals -- yes, a regression function, is likewise, a program.
The thing which is "Turing complete" cannot possibly be any particular regression function, as that is simply one program. So it has to be "the function of the function" -- which isnt a graph.
So again: what exactly is a graph here? What is the definition of NN you're using?
The NN family of algorithms are just means of interpolating through compressed feature spaces; ie., they are just kernel machines which remember compressions of their training data and compute distances for predictions.
That process, like essentially any function-generator, can without much work be made Turing complete. This, like being a "universal funciton approximiator" isnt an interesting property.
But either way, it still doest explain how any of this has anything to do with networks, brains, or anything like it.
I think what op means is this: A graph is mathematically a set of vertices (nodes) and a set of ordered or unordered tuples giving the edges (ties).
Now, sometimes you might have a weight on these edges, for example by specifying some function on the edge set.
However, it is difficult to see how a neural network that includes operations like sum, multiply and tanh, might be modeled this way.
How do you describe a dropout as a graph?
I think the argument is that a graph is not sufficient to describe a NN, so technically speaking a NN is not a graph. It is more. It has edges between x and f(x), but we also need to specify what f(x) is. The mathematical definition of a graph doesn't do that.
A weighted graph can include weights for both the vertices and edges. For example a network latency diagram may include the physical wires as separate from the router as the router latency may depend on network load. Similarly routers themselves have internal bandwidth limitations etc.
As to the rest separating the NN from everything needed to generate it is a useful distinction. You’re not going to generate a different f(x) by slightly changing the training set etc. It’s however a somewhat arbitrary distinction.
> NB. If it's a graph: write out the edge list (etc.) .
I don't understand what issue you are referring to.
For a dense network, each pair of adjacent layers forms a complete bipartite graph. In other words, edges are all pairs with one node in layer N and another in layer N+1.
CNNs and RNNs take a little more work, but still easy to describe the graph structure.
I think op means that a graph is not sufficient to describe a NN. If a layer is Y=XB, then you draw that as set of nodes Y and individual weight b_ij as edge-weights from X. Right.
But can you describe things like concat, max-pooling, attention etc. without changing the meaning of the edges?
Or do you have to annotate edges to now mean "apply function here"? If so, op probably wants to say that you are describing more than a graph. There's a graph there, but you need more, you need elaborate descriptions of what edges do. In that case, op could be correct to say that technically, NN are not graphs.
Or, perhaps NN can generally be represented by vertices and edge lists. It certainly isn't the usual way to draw them, though.
Hi everyone! I made this thing! I'm glad you all like it :) This is actually my first time using javascript, so if there's any issues please let me know and I'll do my best to fix them.
Yeah if you go back to my first commits, the first month is mostly me getting very excited about drawing a square haha. I can’t recommend the PIXI graphics library enough, it was super easy to pick up.
While most of the random starting weights converged quickly, this one got stuck with a fairly incorrect worldview, so to speak:

Is it overfitting to say the same is true for humans, where a brain's starting weights and early experiences may make it much more difficult to achieve an accurate model?
If anyone is interested, here is a simple symbol recognizer using backpropagation I wrote in Python a while ago with the help of the book "Make your own network" by Tariq Rashid. Numpy is a great help with matrix calculations.
For those who want to take this a bit further I can't recommend cs231n enough. Especially when it was taught by Andrej Karpathy. I believe the lectures may still be up on YouTube. Andrej really has a knack for teaching.
The explanation of the calculus in this tool is equally fantastic. And the art is very cute.
There are many ways to skin a cat, of course, but this is as good a tutorial as I've seen for getting you through backprop as fast as possible.