Hi, I am a research engineer in Yann LeCun's group at Facebook. I hate to seem t...

robrenaud · on Sept 10, 2014

Have you read Rich Cuarana's work on approximating the behavior of deep nets with single hidden layer networks?

http://arxiv.org/pdf/1312.6184.pdf

From that paper, it seems like it's just that deep nets are easier to train to find state of the art solutions than shallow nets. It's not that shallow nets are inherently incapable of performing as well. Shallow nets can mimic/approximate a given well trained deep net and preserve almost all of the accuracy. So it's not the case that a good solution to the task doesn't exist in the set of hypothesizes spanned by sized shallow net, it's just that people don't know how to effectively find the right parameters for the shallow nets.

kastnerkyle · on Sept 10, 2014

So far this is only shown on TIMIT and MNIST, which are pretty trivial datasets, so it may be dataset dependent.

One of the authors is giving a talk at INRIA in France this month, and the abstract mentioned CIFAR10 (possibly unpublished results). If they have manged to compress a CIFAR10 network, that is a strong indicator that ImageNet networks could be compressed in the same way... but no one has published any results in this regard to my knowledge.

However this is an active area of research for me, and I hope to explore it more soon. I think there are better ways to approximate than this, but this work at least shows that it may be possible.

srean · on Sept 10, 2014

Minor comment, please link the abstract and not the pdf. Those who want to follow up can pull the link to the pdf from the arxiv abstract.

srean · on Sept 10, 2014

> Aside: even single-layer neural networks do not have convex error surfaces;

Single layers can indeed be made to have convex error surfaces fairly easily. One can do so by matching the error/loss function with the squashing/link function. What some old NN folks got wrong was mixing up square loss with logistic function, that is an unhealthy mix. Now if one were to use KL divergence instead of square loss then one would indeed have a convex loss function. In fact this would be nothing but logistic regression. One can however push this idea further, with any choice of a monotonic squashing function one can derive a 'matching' loss that would give you a convex loss. Classical statisticians know this and call it with a different name: canonical generalized linear models. I am not from that tribe, mine is more ML we may perhaps call it minimizing Bregman loss.

Just so that its clear I am talking about single layer networks not single hidden layer networks, there are plenty of cases were the former is useful.

> There have been no magical optimization breakthroughs

It is arguable whether Hessian free methods, contrastive divergence or auto-encoder based training methods qualify as 'breakthroughs' but they have definitely equipped invigorated researchers in this broad area with their capabilities.

kmavm · on Sept 10, 2014

> It is arguable whether Hessian free methods, contrastive divergence or auto-encoder based training methods qualify as 'breakthroughs' but they have definitely equipped invigorated researchers in this broad area with their capabilities.

The scientific revolution in computer vision right now is due to deep convnets, trained in a supervised way using backprop and SGD. All of the systems we're talking about that have started blowing away records are members of this family, and were trained this way. If Alex Krizhevsky had not entered ImageNet 2012, we would not be having this conversation (in part because I probably would never have gotten curious enough about it to leave my home territories of systems and programming languages). Second order methods, unsupervised pre-training, RBMs, graphical models, etc. etc. etc. were exciting for those inside the field, definitely provided encouragement to those optimistic about deep models, and still might prove important, but they have had little impact and visibility to skeptical people outside, in the way that entering a computer vision competition and murdering all the computer vision systems did.

kastnerkyle · on Sept 10, 2014

Well, HF training was a pretty big deal IMO. Definitely saved my bacon in training some recurrent nets, much easier to get working and/or recovering from bad optimization but pretty slow.

The SGD we use today actually has some strong ties back to that second order optimization work - see some papers by Ilya Sutskever relating a special form of momentum back to second order methods like HF. http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf His dissertation covers this at some length as well.

Using Adagrad, Adadelta, etc. isn't really SGD as it was back in 2012, and this years entrant "GoogLeNet" basically halved the error again using these and other tricks (we think) - which is even more impressive considering 11% - 6.7% is a HUGE increase in difficulty, just my 2 cents.

However, there is a good reason the colloquial name for these things is "Alexnets"... that work was truly incredible and has not stopped behind the doors of Google I don't think.

kmavm · on Sept 11, 2014

By sheer coincidence, Michael Jordan did an AMA yesterday, and made this case better than I could. http://www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_...

michaelochurch · on Sept 10, 2014

Hi, I am a research engineer in Yann LeCun's group at Facebook.

Really? Let's chat offline. I'm michael.o.church at gmail.

I'm not looking for a new job right now but I've been in this business for long enough to know that that can change at any time. If nothing else, I'd love to have lunch the next time I'm in New York (am I correct in assuming that you are in NYC?) and get the kind of intellectual ass-kicking you get when you meet someone who actually knows this sort of field at a deep level.

I'm also getting to the point where I have to decide whether I want to go into "real AI"-- and be a small fish again-- or take the big-fish/smaller-pond path of management (I'm in very early discussions about the MD/Data-Sci position at a fast-growing HK/Sing hedge fund, which probably scares the shit out of guys like you who actually know this stuff, as opposed to traders who read two papers, think they understand them better than they do, and build trading systems.) I'm afraid that if I take the executive/finance track, I might get even farther away from the deep-knowledge/R&D space.

I'm only 31, so I'm not afraid to take the Real AI route and be a small fish in a large/badass pond again. In fact, I'd prefer it, even though the winds seem to be taking me the other way. (Finance/management would be the big-fish/small-pond route, since my data science/ML understanding is well into the top 1% in that world and, again, I'm not offended in the least if you say that that scares you. It scares me. R&D people like you get deep knowledge; guys like me in the private sector-- "private sector", here, meaning startups and finance but not R&D labs-- spend about 85% of our time fighting political battles and self-promoting and rarely have the time to learn anything as deeply as we should.)

I hate to seem to be picking a fight

Don't worry. You're not. It's great to hear from someone who actually gets to use this stuff at work. Thanks for taking the time.

Unfortunately almost every sentence you've written betrays a subtle misunderstanding of the space

I understand the space theoretically, but I'll readily admit that I have, compared to you, almost no real-world experience. I've built neural nets for a few small problems, but nothing at the scale you have.

Perhaps the issues I'm raising are completely theoretical and pose no problem in practice.

Starting at a random (no scare quotes needed) point in weight space

I put "random" in quotes because it's not always clear how to sample a useful "random" point for initialization. There's no such thing as a uniformly random point in R^n, of course, so you need to choose a distribution a priori like U[0,1]^n or N(0, I_n). This seems to pose no problem (even while it's nowhere near the "correct" weights, and we both know that individual weights have no independent meaning in neural nets) if there's a heterogeneity in the scales of the inputs, but can be a problem if you have large scale variations.

If one of your inputs ranges from 0 to 1000 and another ranges from 0 to 0.001, then those "random" (scale-agnostic) weight-initialization distributions actually begin with a 10^6:1 bias favoring the former input. Of course, this is a trivial example and scale normalization is as old as dirt, but I think the point (that useful "random" initialization is not so easily defined, especially when you have deep and messy network topologies in which signals tend to vanish or amplify) is sound.

When you transform the space (e.g. feature extraction, scale normalization, adjustment for multicollinearity) a scale-agnostic distribution like U[0,1]^n or N(0, I_n) becomes dramatically different in terms of what it actually means, relative to the data. The fact that these pre-training techniques seem to be effective if not necessary (at least, the people who I read swear by them) seems to indicate that, at least for some problems, this is a real issue.

With a small number of layers, you still need some randomness to not arrive at the (trivial and useless) stationary point you get from w = 0-- because the hidden nodes don't differentiate-- and then SGD with momentum is enough to get you to a good local minimum. However, it doesn't seem that initialization (beyond "random enough to differentiate the hidden nodes") becomes a major concern for shallower nets.

If I understand correctly, it's when you have 6+ layers (and certain categories of neural nets, like recurrent neural nets, are effectively much deeper) that you start to have these initialization issues, because activation values vanish or grow (to saturation) exponentially in the depth of the network and a bad initial point can leave the network in a borked state (e.g. saturation) where the training performs very badly.

Putting "random" in quotes was an attempt to say, "hey, picking a 'random' point in a useful way is not always trivial, because you still have to choose a sampling distribution a priori" but it was late, I am jet-lagged from a trip to Asia, etc., so maybe I didn't express it well.

Aside: even single-layer neural networks do not have convex error surfaces; convexity, and funky error surface geometry

It's correct that single-layer neural nets are non-convex. My understanding, and correct me if I'm wrong, is that with shallower nets, the "broad == deep" bet (that the best local minima will have the largest basins of attraction) is usually correct, and that this breaks down when the nets are very deep (as with recurrent nets, since BPTT is effectively "unrolling" an RNN into a very deep BPNN). I can't even begin to visualize the error surface of a 10+ layer deep neural net, so if that understanding is wrong, please correct me.

(Broad == deep, the contention that the better local minima are more likely to have larger basins, implies that you're likely to get the optimal local minima with a few initial samples. What you wouldn't want is for all the good local minima to have tiny basins-- to be narrow but deep-- because you'd be unlikely to hit with your initial sampling.)

The problem with single-layer neural nets is not that they converge slowly; the problem is that the layer size needs to grow exponentially with the task size. Single-layer neural nets' universal approximation power is thus not of great practical consequence. The power of deep nets is the power of composition: f(g(h(x))) is a strictly more powerful model than f(x) holding the number of parameters constant.

Thanks. This makes a lot of sense. Yay for composition. (I still have a ways to go in ML; my expertise is in functional programming/language design.)

You just initialize with a Gaussian ball around zero and explore whatever valley in the error surface you happen to be in. Works 100% dandy.

Here's the paper I had in mind when I wrote that comment: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pd... . If I'm misunderstanding the lessons of it, or if the paper is just wrong, please correct me. What I've taken from it is that deep neural network training is quite sensitive to initial starting point, hence the successes of pre-training. To pre-train is, effectively, to change the meaning of "Gaussian ball" (or the like) relative to the data.

That also gets to why I put "random" in quotes. If you do pre-training, feature extraction, et al (which seem to be necessary for many problems but, again, correct me if I'm wrong) then the Gaussian unit ball in the weight space for the (transformed) data is an entirely different set. Even with linear transforms (e.g. scale normalization) this is true.

But again, you've actually used this stuff in your day job and I haven't yet (though I hope to, in my next gig) so I'll defer to your judgment as to whether this is actually an issue. Am I making sense, at least?

It doesn't really make sense to talk about a "single layer convolutional net"

That was my sense, too. It was 11 at night and I didn't want to commit to saying "single-layer convolutional nets are never useful", so I reduced my certainty in what I was saying to "I would hazard the guess ... a lot more useful ...". Generally, when I'm fighting Pacific levels of jet lag and it's after dark, "I can't see how it would work" does not justify "It cannot work".

kmavm · on Sept 10, 2014

Hey, nice to make your acquaintance, and thanks for offering your contact info. Mine can be backed out my HN profile as well. I actually work in Menlo Park; the AI group is split across New York and Menlo Park, with a very small European contingent for now.

WRT the Glorot and Bengio paper, it’s true that there was a lot of excitement surrounding unsupervised pre-training of DNNs, but this mostly preceded the current wave of successes. The big differences between the architectures that are working on image processing today and that paper are:

1. Moar data. The datasets this paper was looking at were on the order 10^5 or so images. 10^6 is a different ballgame.

2. Convolution. Sharing weights really is special. This means there are far fewer parameters to learn in the early parts of the network, and so pre-training seems less necessary.

3. RelU activations. The survey of activation functions uses only smoothly differentiable ones whose gradients get tiny as you are far from zero. RelU has fewer problems with the gradient getting tiny or huge at idiosyncratic points, and also has the virtue of sparsifying the gradients as you backprop (since anything that landed in the negative tail has zero gradient).

So yeah, we really do do entirely unpretrained learning of low-level features, straight from RGB values between 0 and 256, and it works! Isn't that cool??

kastnerkyle · on Sept 10, 2014

Just one more to add... dropout!

That was a big deal, and I think pretty much eliminated greedy layerwise pretraining in the "we have plenty of data, but can't generalize well" case. Good initialization rules help too, but are mostly heuristic and problem dependent to my knowledge.

For interested parties, I will again plug my slides: https://speakerdeck.com/kastnerkyle/euroscipy2014

The last few slides have a kind of "survey list" to get up to speed with modern deep learning approaches for images. I also put the slides on github at http://github.com/kastnerkyle/EuroScipy2014 , which hopefully preserves the hyperlinks where speakerdeck does not.

kmavm · on Sept 11, 2014

I agree dropout is awesome. Buddies? :)

kastnerkyle · on Sept 11, 2014

Yup :)