> Indeed, when I look at the emptyheaded paper for example, I see SIMD paralleli...

alexnewman · on Oct 10, 2017

Heh didn’t Greenplum solve most of the problems google or yahoo had, just at a huge cost. In retrospect now that it’s open source software... I think one of his points was that having horn data unindexed is a big step backwards. I think he’s crazy. Btw frank I love your work on differential data flow!

makmanalp · on Oct 10, 2017

> MR was a shit model and everyone should be using RDBMSes instead

Ah, yes, sorry, I didn't mean to make it sound like I agree with this.

> If you wanted that behavior, with its orders-of-magnitude performance improvements, you could not get it from an existing optimizing RDBMS---not HyPer, nor MonetDB, nor anything else in your list---but you could get it from a more programmable data-parallel system.

Of course! I now realize that you're using terminology in a way that I'm not familiar with, e.g. "computation" meaning something like "arbitrary computation", which upon rereading, makes me understand and agree with a lot more of what you're staying.

What bothered me about your comment was that it sounded a bit like "wow, I found that RDBMSes suck at this specific type of computation that they're not built to deal with, therefore query optimizers suck in general", which seemed like an over-reaching argument.

When the claim is "there exists computations that RDBMS query optimizers suck at", then absolutely, I agree with you to the ends of the earth. If it's also "there's reasons why you want a more MR-like model", again, I agree completely. The point is that having query optimizers and different computational models are separate decisions that don't affect each other - you can have both.

> Stonebraker's claim was that MR was a huge step backwards, which is BS to the extent that RDBMSes weren't solving the problems Google (and others) had.

I guess what I took away from his claim was that the contribution of MR itself was not the problem, but the fact that while creating that model, they ignored a lot of other learnings: e.g. blocking operators can be detrimental, indexes are handy to have. Plus the fact that everyone /else/ who didn't have Google's reasons to forego all those niceties still dove head-first into "let's use MR for everything".

And that's what Elvin is talking about above - you're now seeing examples of tools where lessons from both camps are being applied: "MR-like but be smart enough to not scan everything" (e.g. spark).

> It seemed like we were talking about whether there was a heavy pro-RDBMS bias in the redbook

Ah, for me the main question and discussion above was "are RDBMS techniques even relevant at this point", to which my answer is yes, absolutely. That doesn't mean you have to take every concept from it wholesale: many techniques that developed in one context are applicable in others, regardless of Stonebraker's opinions.

I think also maybe you see everything as firmly in the RDBMS / SQL camp or firmly in the "not" camp, but I really don't think that's the case. E.g. stuff like flink, where we have a lower layer API for arbitrary computations, and higher level APIs for stuff like SQL which compiled down to the lower level language, and get query-optimized different ways in different layers, for example. Or they do some neat join-optimizations. So even with newer computational models there's stuff to learn from old ways, so it's worth it to read the darned book. That's my point here.

frankmcsherry · on Oct 10, 2017

> What bothered me about your comment was that it sounded a bit like "wow, I found that RDBMSes suck at this specific type of computation that they're not built to deal with, therefore query optimizers suck in general", which seemed like an over-reaching argument.

Gotcha. Yes, it was more "I have some things I need to do, and RDBMSes can't do some of those things, which rules them out as a solution". There is for sure lots of great stuff in query optimization, and it makes some queries lots better.

> I guess what I took away from his claim was that the contribution of MR itself was not the problem, but the fact that while creating that model, they ignored a lot of other learnings: e.g. blocking operators can be detrimental, indexes are handy to have. Plus the fact that everyone /else/ who didn't have Google's reasons to forego all those niceties still dove head-first into "let's use MR for everything".

They did not ignore them, they just weren't building a database. MR was much more a scalable HPC replacement than a data management product.

The main reason that the DB community took a huge step backwards is that they (incl Stonebraker) had doubled-down on mediocre compute abstractions, and found they needed to revisit much of what they'd done, because it just didn't work.

> Ah, for me the main question and discussion above was "are RDBMS techniques even relevant at this point", to which my answer is yes, absolutely. That doesn't mean you have to take every concept from it wholesale: many techniques that developed in one context are applicable in others, regardless of Stonebraker's opinions.

I totally agree with you that they are relevant (and am active in the area). Stonebraker takes a much stronger position, and the appeal to his authority was what triggered me. I didn't mean to point that at you as much as it may have turned out.