Neural Networks from Scratch

minihat · on Oct 11, 2021

Each time I teach neural nets to an engineer, there's only a 50% chance they can write down the chain rule. Colah's blog on backprop used to be my favorite resource to leave them with (https://colah.github.io/posts/2015-08-Backprop).

The explanation of the calculus in this tool is equally fantastic. And the art is very cute.

There are many ways to skin a cat, of course, but this is as good a tutorial as I've seen for getting you through backprop as fast as possible.

baron_harkonnen · on Oct 11, 2021

Given the current state of automatic differentiation I'm not so sure it's even necessary or particularly useful to focus on backpropagation any more.

While backprop has major historic significance, in the end it's essentially just a pure calculation which no longer needs to be done by hand.

Don't get me wrong, I still believe that understanding the gradient is hugely important, and conceptually it will always be essential to understand that one is optimizing a neural network by taking the derivative of the loss function, but backprop is not necessary nor is it particularly useful for modern neural networks (nobody is computing gradients by hand for transformers).

IMHO a better approach is to focus on a tool like JAX where taking a derivative is abstracted away cleanly enough, but at the same time you remain fully aware of all the calculus that is being done.

Especially for programmers, it's better to look at Neural Networks as just a specific application of Differentiable Programing. This makes them both easier to understand and also enables the learner to open a much broader class of problems they can solve with the same tools.

medo-bear · on Oct 11, 2021

Backpropagation is a particular implementation of reverse mode auto-differentiation, and it is the basis for all implementaions of DL models. It is very strange for me to read this as though it is very obvious and commonly accepted fact, which I don't think it is.

baron_harkonnen · on Oct 11, 2021

> to read this as though it is very obvious and commonly accepted fact

I'm not entirely sure what you're referring to by "this" but assuming you mean my comment, I think what I'm saying is very much up for debate and not an "obvious and commonly accepted fact". Karpathy has a very reasonably argument that directly disagrees with what I'm suggesting [0]. Of course he also agrees that in practice nobody will every use backprop directly.

Whether it's JAX, TF, PyTorch, etc the chain rule will be applied for you. I'm arguing that I think it's helpful to not have to worry about the details of how your derivative is being computed, and rather build an intuition about using derivatives as an abstraction. To be fair I think Karpathy is correct for people who are going to be learning to explicitly be experts in Neural Networks.

My point is more that given how powerful our tools today are for computing derivatives (I think JAX/Autograd have improved since Karpathy wrote that article), it's better to teach programmers to learn think of derivatives, gradients, hessians etc as high level abstractions. Worrying less about how to compute them and more about how to use them. In this way thinking about modeling doesn't need to be restricted to strictly NNs, but rather use NNs and example and then demonstrate to the student that they are free to build any model by defining how the model predicts, scoring the prediction and using the tools of calculus to answer other common questions you might have.

edit: a good analogy is logic programming and backtracking/unification. The entire point of logic programming is to abstract away backtracking. Sure experts in Prolog do need to understand backtracking, but it's more helpful to get beginners understanding how Prolog behaves than understand the details of backtracking.

[0] https://karpathy.medium.com/yes-you-should-understand-backpr...

medo-bear · on Oct 11, 2021

but with backprop you do not worry about computing derivatives by hand. backprop and AD in general means you do not have to do that. maybe one of us is misunderstanding the other

i am saying that if you want to work with ML algorithms on a more deeper level you must learn backprop

if you want to implement some models on the other hand, you can just follow a recipe approach

matsemann · on Oct 11, 2021

> there's only a 50% chance they can write down the chain rule

Why should I, though? I remember the concept from calculus. I know pytorch keeps track of the various stuff I do to a vector and calculates a gradient based on it. What more do I need to know when all I want to do is to play with applications, not implement backprop myself?

Imnimo · on Oct 11, 2021

Certainly there's a lot you can do without understanding backprop - you can train pre-made architectures, you can put pre-made layers together to build your own architecture, you can tweak hyperparameters and improve your model's accuracy, and so on. But I also think you will eventually run into a problem that would be much easier to debug if you understand backprop. If your model isn't learning, and your tensorboard graphs show your gradient magnitude is through the roof, it'll be much easier to track that down if you have a strong conceptual model of how gradients are calculated and how they flow backwards through the network.

ZitchDog · on Oct 11, 2021

I could be wrong but I think it’s possible to understand backprop without being able to apply the chain rule from memory.

medo-bear · on Oct 11, 2021

backpropagation IS the chain rule ... with book keeping

medo-bear · on Oct 11, 2021

if you don't understand chain rule then you dont understand backprop, which means you do not really understand how deep learning works. at most you can follow recipes cook book style. it is kind of how one can make a website without a deep understanding of networking

kurikuri · on Oct 12, 2021

This kind of argument can always go further by removing abstraction… you could argue that you don’t understand the chain rule without understanding the quotient and product rules which are used to prove the chain rule, and then once further with the epsilon-delta proofs required to get limits going for the derivative in the first place, and then continue until we reach some axioms.

I think you can accept that the chain rule is a thing, without understanding it, and then go further to understand its application to backprop.

baron_harkonnen · on Oct 11, 2021

> at most you can follow recipes cook book style.

Here I disagree with you pretty strongly. Once someone is comfortable with differentiable programming it's much more obvious how to build and optimize any type of model.

People should be more concerned about when to use derivatives, gradients, hessians, Laplace approximation etc rather than worry about the implementation details of these tools.

Abstraction can also aid depth of understanding. I know plenty of people who can implement backprop, but then don't understand how to estimate parameter uncertainty from the Hessian. The latter is much more important for general model building.

medo-bear · on Oct 11, 2021

i am not sure what you are disagreeing with. chain rule is basic calculus that precedes understanding hessians. my argument is, if you can not understand what the chain rule is, you will not understand more complicated mathematics in ML. do you think i am wrong ?

EDIT: also uncertainty estimation is the stuff of probabalistic approach to ML. i would say that people who do probabalistic ML are quite mathematically capable (at least to my experience)

baron_harkonnen · on Oct 11, 2021

> chain rule is basic calculus that precedes understanding hessians.

It doesn't have to be that way. The hessian is an abstract idea and the chain rule and more specifically backpropagation are methods of computing the results for an abstract idea. When I want the hessian I want a matrix of second order partial derivatives, I'm not interested in how those are computed.

For a more concrete example, would you say that using the quantile function for the normal distribution requires you to be able to implement it from scratch?

There are many, very smart, very knowledgeable people that correctly use the normal quantile function (inverse CDF) every day for essential quantitative computation that have absolutely no idea how to implement the inverse error function (an essential part of the normal quantile). Would you say that you don't really know statistics if you can't do this? That a beginner must understand the implementation details of the inverse error function before making any claims about normal quantiles? I myself would absolutely need to pull up a copy of Numerical Recipes to do this. It would be, in my opinion, ludicrous to say that anyone wanting to write statistical code should understand and be able to implement the normal quantile function. Maybe in 1970 that was true, but we have software to abstract that out for us.

The same is becoming true of backprop. I can simply call jax.grad on my implementation of loss of the forward pass of the NN I'm interested in and get the gradient of that function, the same way I can call scipy.stats.norm.ppf to get that quantile for a normal. All that is important is that you understand what the quantile function of the normal distribution means for you to use it correctly, and again I suspect there are many practicing statisticians that don't know how to implement this.

And to give you a bit of context, my view on this has developed from working with many people who can pass a calculus exam and perform the necessarily steps to compute a derivative, but yet have almost no intuition about what a derivative means and how to use it and reason about it. Calculus historically focused on computation over intuition because that was what was needed to do practical work with calculus. Today the computation can take second place to the intuition because we have powerful tools that can take care of all the computation for you.

medo-bear · on Oct 11, 2021

> Today the computation can take second place to the intuition because we have powerful tools that can take care of all the computation for you.

and that tool is backprop. if you do not understand what the chain rule is and what it is doing, that tool will be magic to you and you are blindly trusting its correctness. seeing that alot of risk is involved in using AI models in real life, blindly trusting your model is not a good approach

i agree that simply regurgitating rules of calculus is pointless to understanding. but thats definitely not what i mean when i talk about the need to understand the chain rule

ML is a mathematically intensive subject. there is no going around this fact

mpfundstein · on Oct 12, 2021

do you know all the assemlber instructions your pc/mac carried out for you in order to post this text on hn? i guess not

medo-bear · on Oct 12, 2021

but that's my point. knowing how to compile a program does not make me a compiler engineer. in that sense feel free to use ML tools, but don't be fooled into thinking you will get a job as an ML engineer if you do not know what the chain rule is, or why we need to take a derivative in order to optimise a loss function. in fact, don't even be fooled into thinking you will get into a ML uni degree if you don't know what the chain rule is. i actually don't understand what is the problem. spend 10 minutes reading up on it and i am sure you will get it. i think an unwarranted phobia of mathematics is what is at play here

tchalla · on Oct 11, 2021

> my argument is, if you can not understand what the chain rule is, you will not understand more complicated mathematics in ML.

Are you sure about this?

medo-bear · on Oct 11, 2021

yes. in europe admission into an ML-type masters degree lists all three standard levels of mathematical analysis as a bare minimum for application

tchalla · on Oct 11, 2021

If by understand, you mean understand and not regurgitate it when asked as a trivia question - I agree with you. However, there are different interpretations of the chain rule.

thewarrior · on Oct 11, 2021

Are there any books that teach differentiable programming ?

medo-bear · on Oct 11, 2021

not books but there are quite a few interesting and accessible papers. here is one

Pearlmutter, B.A. and Siskind, J.M., `Reverse-Mode AD in a Functional Framework: Lambda the Ultimate Backpropagator,'

http://www.bcl.hamilton.ie/~qobi/stalingrad/

jhgb · on Oct 11, 2021

> there's only a 50% chance they can write down the chain rule

I blame the common mathematical notation for that.

friebetill · on Oct 11, 2021

I found this 13 min explanation very helpful in understanding backpropagation (https://youtu.be/c36lUUr864M?t=2520).

First he explains the necessary concepts:

1) Chain Rule

2) Computational Graph

Then he explains backpropagation in these three steps (first in general and then with examples):

1) Forward pass: Compute loss

2) Compute local gradients

3) Backward pass: Compute dLoss/dWeights using the Chain Rule

ThinkingAgain · on Oct 12, 2021

After reading the tutorial I was not sure why it was called backpropagation. Thanks for the Colah's blog link. I think the two links together explains the things beautifully. Backpropagation seems just like an optimization for the gradient descent calculation as per Colah's blog.

shaan7 · on Oct 11, 2021

Any recommendations for a 101 book for neural nets for someone who is "just a programmer"? OP's tutorial is quite nice, but I love to read books and find it easier to learn from them.

jacobcmarshall · on Oct 11, 2021

Deep Learning with Python by Chollet is an excellent beginner resource if you are a hands-on learner.

It starts off with some tutorials using the Keras library, and then gets into the math later on.

By the end of the book, you create multiple different types of neural networks for identifying images, text, and more! I highly recommend it.

aeg42x · on Oct 11, 2021

I highly recommend http://neuralnetworksanddeeplearning.com/ it’s an online book that has some great code examples built in.

bernulli · on Oct 12, 2021

I also found his ‘visual proof’ for neural networks as general function approximators super intuitive.

wesleywt · on Oct 11, 2021

Fastai has a course: practical deep learning for programmers.

carom · on Oct 11, 2021

There are also coursera specializations from Andrew Ng at https://deeplearning.ai.

matsemann · on Oct 11, 2021

Ng's course is bottom up: Start with the basic math, expand upon it, until you arrive at ML and neural nets.

Fastai is top down: learn to use practical ML with abstractions, and then dig deeper and explain as needed.

I preferred fastai's approach, even though I enjoyed both. Ng's could be a bit too low level and fundamental for what I wanted to learn.

carom · on Oct 11, 2021

This is a valuable take. Fastai was very frustrating for me because I wanted to understand the internals. I ended up not finishing it, so take my opinion with a grain of salt.

windsignaling · on Oct 11, 2021

I much prefer Andrew Ng's courses as well.

I tried Fast AI, but it seems to be trying too hard to take out the math, which oddly for me (as a STEM grad) makes it much more difficult to understand.

Had to stop when I saw him using Excel spreadsheets to explain convolution.

knicholes · on Oct 11, 2021

The Excel spreadsheets is where it all clicked for me. It's not done with some magical library but with simple math being performed on some weights.

carom · on Oct 11, 2021

There is a book called neural networks from scratch at https://nnfs.io.

shaan7 · on Oct 13, 2021

Thanks for the recommendations folks! :)

bnegreve · on Oct 11, 2021

I don't think it is a good idea to describe neural networks as a large graph of neurons interacting with each other. It is not really helpful to understand what is going on inside.

It is more useful to understand them as a series of transforms that bend and fold the input space, in order to place pairs of similar items close to each other. I would like to see people trying to illustrate that instead.

It also has the benefit of making the connection with linear algebra much easier to understand.

joefourier · on Oct 11, 2021

You really find large n-dimensional transforms easier to reason about and visualise, as opposed to layers of neurons with connections? You don’t find it much more intuitive to see it as a graph once you start adding recurrence, convolutions, sparsity, dropout, connections across multiple layers, etc., let alone coming up with new concepts?

I think it’s useful to understand it in both ways, but our intuitions about transforms are largely useless when the number of dimensions is high enough.

ravi-delia · on Oct 11, 2021

I think understanding how neural networks work is easiest if you think of them as networks. Reasoning about why they work is a lot easier thinking about them as transformations. It's not like you're actually picturing all the parameters of a nontrivial network one way or the other.

nerdponx · on Oct 11, 2021

It's good to have both perspectives. Ideally you learn the layers-of-transforms version alongside the styled graph-of-neurons version. If you had to only pick one, which one you learn would depend a lot on what kind of student you are and what your goals are. I think the layers-of-transforms version is "less wrong" in general, but probably harder to understand, so it's maybe better if you had to learn just one.

farresito · on Oct 11, 2021

Not the person you are answering to, but I think it's all about the level of abstraction you want to reason at. I didn't grok neural networks until I visualized the transformations that were happening in a very simple network. Once that made sense, I could start thinking in terms of layers.

farresito · on Oct 11, 2021

Totally agree with you. The article that opened my eyes was this[0] one. This[1] video is also very good.

[0] https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

[1] https://www.youtube.com/watch?v=e5xKayCBOeU

mjburgess · on Oct 11, 2021

A Neural Network isnt a graph in any case, and isn't based on the brain.

As you said, it's a sequence of transformations.

NB. If it's a graph: write out the edge list (etc.) .

NNs are diagrammed as graphs, but this is highly misleading.

Retric · on Oct 11, 2021

Neural Networks are a graph more specifically a Weighted Directed Graph.

They are also very much modeled after the brain, more specifically they originate from a 1943 paper by neurophysiologist Warren McCulloch and mathematician Walter Pitts who described how neurons in the brain might work by modeling a simple neural network.

Of course it’s not an accurate model, but it very much is based on early understanding of biological neurons.

mjburgess · on Oct 12, 2021

The 1943 NN compu sci model corresponds to a 1943 brain model, sure. NNs in an ML sense arent this model.

What does any modern NN compu-sci model correspond to?

You're making a genetic fallacy here: that since the origin of X had property P, so must X. Eg., marriage was historically a legal contract of ownership, so today it is a legal contract of ownership.

The brain has nothing akin to back-propagation, nor supervised training. "Neurones" are not "data about house prices". According to any even highly simplified neurological model, a compu sci model of this, would compute with activations in a radically different manner than a dot-product over domain-specific data.

They are also not a weighted directed graph. In a sense almost any mathematical function is a weighted graph if one takes edges to be, eg., general function composition; and nodes to be input/output domains of those functions.

However to say a "NN is a network" is presumably meant to be more significant than this trivial observation that essentially "all functions" are.

In particular everything which distinguishes a NN from any mathematical functions are operations such as supervised training, backprop, regularization, pooling, etc. And these are not expressible as functions within this graph, but they are functions over the graph.

So any actual "network diagram" presented isn't a network. Its a bad diagram of a mathematical function which also includes functions over this diagram.

In any case, a neural network -- in the ML sense -- is neither neural nor a network. It is a function of a mathematical function that has no isomorphism to anything in the brain.

Retric · on Oct 12, 2021

Back propagation etc isn’t a property of a NN it’s a means to generate a NN. You can argue there is some quantitative difference between modern neutral networks but they hadn’t reached the point of trying to train a neural network, it was much more abstract.

Also, the brain does actually use a form of supervised learning. Pain for example is effectively a form of input labeling and lack of pain the wanted output.

mjburgess · on Oct 12, 2021

If the neural network is just the piece-wise linear regression function that the NN algorithm generates -- then its just a network like any other mathematical function.

And it's a pretty crappy mathematical function, with zero homology to the brain. Nothing in the brain is piece-wise linear.

Retric · on Oct 13, 2021

NN are a more than just feed forward NN. Assuming infinite precision finite NN can be turing complete.

mjburgess · on Oct 13, 2021

So is backprop a property or not?

> Back propagation etc isn’t a property of a NN it’s a means to generate a NN.

You are the one claiming it isnt

Retric · on Oct 14, 2021

No, I don’t think you understand the terminology.

Feed forward NN means a NN without loops. Back propagation is a method of generating a new NN that only works on a subset of NN.

I can generate a random number from a rolling a size sided dice once, just not any possible number.

mjburgess · on Oct 14, 2021

I don't know what we're describing by "NN"; which is, in large part, exactly my point.

I was taking you to be flip-flopping on backwards/forwards feeding; rather than to be talking about recurrent loops.

What is meant to be meaningfully a graph? What definition of "neural network" are you using?

The NN model itself, even recurrent, is just a regression function. In the sense that a Turing machine is just an implementation of a function from the Naturals to Naturals -- yes, a regression function, is likewise, a program.

The thing which is "Turing complete" cannot possibly be any particular regression function, as that is simply one program. So it has to be "the function of the function" -- which isnt a graph.

So again: what exactly is a graph here? What is the definition of NN you're using?

The NN family of algorithms are just means of interpolating through compressed feature spaces; ie., they are just kernel machines which remember compressions of their training data and compute distances for predictions.

That process, like essentially any function-generator, can without much work be made Turing complete. This, like being a "universal funciton approximiator" isnt an interesting property.

But either way, it still doest explain how any of this has anything to do with networks, brains, or anything like it.

jonnycomputer · on Oct 11, 2021

Yeah, I very much don't understand OP's argument. And its trivial to write out the nodes and edges (at least for trivially sized neural networks).

zwaps · on Oct 11, 2021

I think what op means is this: A graph is mathematically a set of vertices (nodes) and a set of ordered or unordered tuples giving the edges (ties). Now, sometimes you might have a weight on these edges, for example by specifying some function on the edge set.

However, it is difficult to see how a neural network that includes operations like sum, multiply and tanh, might be modeled this way. How do you describe a dropout as a graph?

I think the argument is that a graph is not sufficient to describe a NN, so technically speaking a NN is not a graph. It is more. It has edges between x and f(x), but we also need to specify what f(x) is. The mathematical definition of a graph doesn't do that.

Retric · on Oct 11, 2021

A weighted graph can include weights for both the vertices and edges. For example a network latency diagram may include the physical wires as separate from the router as the router latency may depend on network load. Similarly routers themselves have internal bandwidth limitations etc.

As to the rest separating the NN from everything needed to generate it is a useful distinction. You’re not going to generate a different f(x) by slightly changing the training set etc. It’s however a somewhat arbitrary distinction.

laGrenouille · on Oct 11, 2021

> NB. If it's a graph: write out the edge list (etc.) .

I don't understand what issue you are referring to.

For a dense network, each pair of adjacent layers forms a complete bipartite graph. In other words, edges are all pairs with one node in layer N and another in layer N+1.

CNNs and RNNs take a little more work, but still easy to describe the graph structure.

zwaps · on Oct 11, 2021

I think op means that a graph is not sufficient to describe a NN. If a layer is Y=XB, then you draw that as set of nodes Y and individual weight b_ij as edge-weights from X. Right.

But can you describe things like concat, max-pooling, attention etc. without changing the meaning of the edges? Or do you have to annotate edges to now mean "apply function here"? If so, op probably wants to say that you are describing more than a graph. There's a graph there, but you need more, you need elaborate descriptions of what edges do. In that case, op could be correct to say that technically, NN are not graphs.

Or, perhaps NN can generally be represented by vertices and edge lists. It certainly isn't the usual way to draw them, though.

aeg42x · on Oct 11, 2021

Hi everyone! I made this thing! I'm glad you all like it :) This is actually my first time using javascript, so if there's any issues please let me know and I'll do my best to fix them.

windsignaling · on Oct 11, 2021

"first time using javascript"

Impressive. I think the first time I used Javascript I made a button.

aeg42x · on Oct 11, 2021

Yeah if you go back to my first commits, the first month is mostly me getting very excited about drawing a square haha. I can’t recommend the PIXI graphics library enough, it was super easy to pick up.

Pensacola · on Oct 11, 2021

Hi, nice site! But since you asked, here's an issue: the little "Click to increase or decrease weights" feature doesn't work in Firefox.

aeg42x · on Oct 11, 2021

Thank you! It should be working now!

robomartin · on Oct 11, 2021

For those curious about the nabla (∇) or gradient symbol (not a Greek letter):

https://en.wikipedia.org/wiki/Nabla_symbol

mLuby · on Oct 11, 2021

While most of the random starting weights converged quickly, this one got stuck with a fairly incorrect worldview, so to speak:

![data=blueberries in center ringed by strawberries; model=top and bottom third red, middle third blue instead of the expected outer ring red inner circle blue.](https://imgur.com/a/N2w69Mp)

Is it overfitting to say the same is true for humans, where a brain's starting weights and early experiences may make it much more difficult to achieve an accurate model?

kebsup · on Oct 11, 2021

Very nice. I've have created very similar thing a few years ago, but yours is nicer. :D https://nnplayground.com

shimonabi · on Oct 11, 2021

If anyone is interested, here is a simple symbol recognizer using backpropagation I wrote in Python a while ago with the help of the book "Make your own network" by Tariq Rashid. Numpy is a great help with matrix calculations.

https://www.youtube.com/watch?v=IAQyVmTDz0A

abraxas · on Oct 11, 2021

For those who want to take this a bit further I can't recommend cs231n enough. Especially when it was taught by Andrej Karpathy. I believe the lectures may still be up on YouTube. Andrej really has a knack for teaching.

spoonsearch · on Oct 11, 2021

Very nice, the color combination and the UI is so pleasing. The explanation is cool :)

andreyk · on Oct 11, 2021

Pleasantly surprised by this, not yet another blog post on this but rather a nice interactive lesson. Well done!

mdp2021 · on Oct 11, 2021

On mine, the textboxes are broken - overlapping other areas and rendered with a heavy blur.

aeg42x · on Oct 11, 2021

Are you using mobile or a browser? And could you please post a screenshot? I'll see what I can do! Thank you!

sarathyweb · on Oct 11, 2021

The text is too small to read on my phone. I cannot zoom in either :(

synergy20 · on Oct 11, 2021

very cool, nice UI, simplest tutorial but grasps the gist, perfect for starters to have a big picture before diving in the details.

pplanel · on Oct 11, 2021

Can't start in Android's Firefox.

moffkalast · on Oct 11, 2021

Probably can't start in Netscape Navigator either, the audacity.