Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Distributed Systems Are a UX Problem (bravenewgeek.com)
89 points by platz on June 4, 2015 | hide | past | favorite | 43 comments


I work on distributed systems, and I thought this was a nice post and echoes some of my own sentiments.

The hardest problems in distributed transactions (banks, inventory, etc.) are often easier solved with human psychology (UX) than algorithms.

Hi, I am an ATM. Yes you have money, yes I am offline and can't check, yes you can withdrawal so you as the customer are happy with high availability. BUT I know who you are, and if you cheat me I will punish you when I find out!

Hi, I am a shopping cart. Why yes we have one of those in stock, but I am offline so I can't check. I'll take your money now and have it 2 day delivery for you. Oops, I just found out we don't have it in stock but I already have your money, this will take a few weeks now but we'll give you $20 off your next purchase - or do you want a refund?

This is the better approach, changing your business model to prioritize customer satisfaction (UX). Trying to build a globally consistent system instead either has to break the laws of physics with the speed of light, or make your customer have to wait - and if they have to, they probably won't be your customer anymore. One of these options is possible, but incorrect for your business, therefore use a distributed system with good UX.


> Why yes we have one of those in stock, but I am offline so I can't check.

Assuming the "I am offline" part is redacted, I view this as deliberate lying - deliberate in the "we already have your money" sense that you identified.

If I had made a decision to order with you because you had claimed something was in stock, but it wasn't, I will withdraw my patronage from you and I will complain loudly and vigorously.

This type of behaviour/UX has permanently harmed my relationship with several large retailers.


I think to call it lying is based on the assumption that there is some perfect answer. There usually isn't.

I work at a university library. Our inventory is fairly small on internet scale (~3 million items in 5 or 6 separate 'warehouses'; although we usually only have 1-3 count of each 'item' in inventory), so we don't really have these large scale/distributed problems. The biggest problem we have with the system saying an item is 'in stock' when it isn't is -- the item has been stolen or lost, and we haven't noticed yet and recorded it as such.

Is it "lying" if our system says it's on the shelf, when in fact it's been stolen or lost and we haven't noticed yet?

There are obviously ways we could improve our 'loss reduction'. But there will ALWAYS be cases where the system's knowledge is an imperfect representation of the real world, in any system.

"I've been offline for 10 minutes so the last information I have is as of 10 minutes ago" is just one more.

You can spend more money to try to make the information more accurate, but it will never reach 100% (even before you add in distributed computing, which adds some of it's own issues), so as with everything, it's cost-benefit, how much does the customer care, what can we afford to do, at what point is our information good enough to keep them happy -- and, like the OP says, how do we properly make the UX to keep them happy despite information that's not 100% accurate, which it NEVER will be.


I think the problem here is that these systems are frequently set up to look authoritative. "Hurry up! Only 1 left!" is a common sight on Amazon.

The user doesn't care about the challenges of a globally consistent distributed database, all they know is you said there was one left so they bought from you and now you're telling them you were wrong. You set expectations and then failed to meet them and that upsets people

If your system is not quite perfect, especially around something that can drive a purchasing decision, then make it clear to the user. "Hey, we're low on stock, we think we have 1 left but we might be out". Maybe you can even give a confidence interval, like "This item sells very quickly so we're probably out by now and don't realize it" vs. "we sell two of these a year and know that as of 5 minutes ago there was one left so we probably still have it". Now the user can make an informed decision.


That's where the "compensation" part of the original article comes in. If a company's doing this strategy right, they make it up to the customer in some generous way - "We'll give you a full refund and your next purchase is free" if the shopping cart says something is in stock and it isn't, "We'll pay for your remodel" if someone trashes your AirBnB, "We'll give you a free ticket" if you get bumped from a flight.

What some of the smarter big companies have realized is that emotions are fungible, and they work on a "last writer wins" basis. If you do something really nice for the customer after inconveniencing them (and it has to be more "nice" than the initial problem was "nasty"), they remember you making it up to them, not the initial problem. That shifts the cost of compensation back onto the company, which gives them an incentive to improve their systems, but also lets them trade-off occasional hefty compensation charges against getting 100% consistency & availability, which is impossible.


> I think to call it lying is based on the assumption that there is some perfect answer.

No, I don't make that assumption. I call it lying in the sense that stock information is given in the hope that it will convince me to use one retailer over another (and for no other reason). If the retailer is wrong for whatever reason, they have duped me. Deliberately - they weren't required to make such a claim.

That's absolutely distinct from the library case, where I understand that this information is only being provided to me as a service for my benefit. The information is not designed to trick me, but to save time compared to always searching manually. Thanks, by the way.


Personal curiosity: Do you still use Amazon? They do this behavior.

Clarification: It need not even be offline. Even if you are online, there is no way to tell if in the TimeToServer after you clicked the buy button that somebody else hasn't already clicked the buy button.

And you can't say this is trivially solved by a central master server that uses a "first come first serve" basis that then knows to report an error back to you after you clicked the button.

Why? Because the whole point of online shopping is that you might have customers on opposite sides of the nation or world. There is no guarantee they will hit the same server, or even if you sharded it so they will a single server can only physically scale up to so many requests.

So point being this IS NOT LYING, because suggesting I can communicate faster than light or predict the future is lying. As I believe the article mentioned, computers can only make their best guess. This isn't a lie, but yes they might make a mistake - and the best corrective action is to apologize and compensate.

It would be a far greater mistake to assume FTL knowledge, and a lie to think the machine can't sometimes fail and be wrong.


I've more often had the opposite experience with Amazon. They usually hit the early side of their delivery estimates.

Amazon manages my expectation correctly by saying "Only N left in stock" where that is appropriate. They have earned enough trust and applied this consistently well enough for me to believe they are saying this mostly for my benefit, rather than just to manipulate me into panic buying. Although I don't prefer to shop at Amazon, that's an impressive feat.

It is lying to make a blanket statement like "in stock" when you know there is a fair chance that it isn't.

The number of times when I will lose a "first come first serve" retail battle due to light speed should be infinitesimally small (or predicatable: ticket sales sites have developed a decent system for dealing with this). Even if this did happen, you could certainly recognise it before taking my money: you don't have to take my money at the speed of light. I don't believe this is the cause of any of the delivery/stock errors that I have experienced.

Compensating me by offering me money off my next purchase is not compensation: it's a shady marketing trick.


It is lying to make a blanket statement like "in stock" when you know there is a fair chance that it isn't.

Exactly. The CAP theorem isn't the customer's problem.

I do find it infuriating when a merchant tells me they have something in stock, takes my money, and then says they don't. The solution to this is easy enough for routine sales of off-the-shelf products: take the money at the time you physically ship the product to the customer. If you take an order in real time and then find you can't actually ship it within a short time after that order was placed then own up and offer the customer the best options you can.

A business that took my money for something they claimed was in stock but then didn't ship and held the money for a significant period would get a polite enquiry the first time, but if that didn't result in the product being shipped or an immediate full refund, they'd get the book thrown at them. This is probably the most common complaint I've heard among friends and family with on-line merchants as well, and as far as I can tell, pretty much everyone takes the same view about this one.

Compensating me by offering me money off my next purchase is not compensation: it's a shady marketing trick.

I agree with this, too. Taking someone's money and then failing to provide the goods is breach of contract, pure and simple. A merchant can certainly offer the customer some form of compensation for the inconvenience. If they have a generally good reputation with that customer and it's a rare event then maybe the gesture will help to maintain that positive relationship. But the customer has no obligation to accept the gesture, and if the merchant does this systematically, they deserve every chargeback they get.


>> It is lying to make a blanket statement like "in stock" when you know there is a fair chance that it isn't.

Well, hold on. They're not intentionally lying. They are allowing a condition where an inaccuracy could present itself, but is that lying? That is the last information that the system has. I suppose we could always use UX labeling that doesn't imply any commitment, i.e. "Reported in stock" or "In stock at last check" and that would be more strictly accurate. I wonder how users would react?


>They are allowing a condition where an inaccuracy could present itself, but is that lying?

Yes. Allowing that inaccuracy to present itself only benefits the retailer, and only inconveniences the customer.


This is _exactly_ how Amazon works. It has a best guess at inventory at a given moment, but it's a snapshot in time that is sometimes wrong. The inventory is verified. The money is taken out. The pieces are all decoupled and essentially offline. This isn't a lie, it's an optimization and works quite well. If any piece fails, the recovery is obvious and easy to deal with.


There seems to be a bit of confusion over just what exactly I'm trying to get at here. I'll be the first to admit, the article might not do it justice. This is a comment I posted on it which hopefully helps clarify:

The point is there are certain realities inherent in distributed systems which can’t be papered over, and those realities often manifest themselves at all levels in the stack. Sure, we can try to build abstractions which hide those problems, but they often leak or simply eschew the problems in dangerously subtle ways.

Frontend and UX folks need to understand what they are building on. They need to understand the potential pitfalls of these abstractions and why they are as such. Sometimes there is a good business case for apologizing, sometimes there’s not. These things are never black and white, but if everyone can understand how systems work, we can build better ones and make better choices.


And my reply:

I agree that there are complexities (as you state, “certain realities”) inherent in distributed systems. If there is one thing I believe we could both agree upon is that distributed systems increase the surface area of complexity exponentially when compared to the relatively simpler stack of clients communicating with one system.

Failures in a distributed environment must be expected. Embraced even, since treating them as “an exception” is tantamount to the proverbial ostrich poking its head into the sand.

However, failure in a distributed system does not necessitate “Frontend and UX folks” having to address failures which are not directly a result of client communication with the next tier. For example, should the UX be prepared to handle when the primary provider of a service is down and fail over to a secondary system? If the service is the one which the client is directly interacting with, then likely that makes sense.

But what of the case when the primary system has issues interacting with its collaborators? Such as its persistent store. Or a strategic partner providing required functionality for the workflow. Is it your position that this type of failure be handled by the UX?

Why wouldn’t a server maintain separation of concerns and only communicate to a client when its options have been exhausted? IOW, why assume what “the other side of the connection” is going to do in order to satisfy a request placed on it?

In summary, my position is that when the collaborators in a distributed system have a well defined software contracts (protocol, expectations, etc.) and all parties adhere to them (without assumptions in implementation), then this allows the system(s) involved to provide resilience in depth. However, the fashionable trend of replicating a PowerBuilder environment does not lend itself to these separation of concerns.

Foisting complexity to either end of the spectrum is where truly nasty problems are bred.


In a distributed system, there is no "primary" system. All systems are distributed and decoupled. The UX sends a request and waits for a response that itself be composed of other requests and responses from multiple other systems. The failures have to be assumed and mitigated for at the UX level, not at some back-end level.


By "primary system", I meant the initial system(s) interacted with in a distributed environment. Assuming a connection based distributed model (e.g. TCP/IP), there is a system on the other end of the connection to which a client is communicating. Even for connectionless models, in my mind the "primary system" would be defined as the set of systems which could possibly respond to a client's request.

Perhaps the term "primary" is not the best fit for this role. It just seemed less verbose than saying something like "the set of systems for which a client is directly aware and may interact with."


As a UX person, I'm intrigued by this idea, but I would probably avoid creating situations where it is necessary to apologize. If you confirmed a flight, hotel and rental car, but then came back with a "Sorry!" later, that would be a negative experience for a user. They may have passed the confirmation on to others or set other plans in motion, which may also have to be "rolled back". You would need to compensate customers for the inconvenience.

The way to avoid that is to show requests in a pending state, but this has other implications. Users will want to continue to shop and to cancel pending requests, which might mean lost sales.

Generally it's more important to paper over the realities of a complicated system when you have a customer-facing self-service tool. Exposing those realities might create too much complexity for self-service, and you may need sales and service people to manage it. Basically you've built a travel agency instead of Expedia, which, again, is more costly.


> A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.

-- Leslie Lamport


Running this argument in the other direction: if you want an uncompromised user experience, you should use or build a centralized system rather than a distributed one.


Reading your comment made me think of Apple and Microsoft. Microsoft makes an operating system that runs on a distributed hardware ecosystem and they have the unenviable task of making the user with crappy (but cheap) hardware actually work. Apple is completely vertically integrated, so they can make sure everything works smoothly in their vision, but you only get Apple's vision.

Then you have Linux, which is a distributed software ecosystem on top of a distributed hardware ecosystem. UX is terrible, but flexibility is through the roof. You can pretty much have exactly what you want, although you might have to do parts of it yourself.


UX doesn't have to be terrible. It's just a long twisty maze of interlocking decisions. Depends on the U, too - I feel better with command line tool output and filters than with fancy GUI furniture.


Depends on what you want out of your UI. I ran Linux for years and years on my laptops. But getting a consistent look was difficult: Firefox, OpenOffice (at the time), Emacs, Inkscape, Gimp, the KDE CD burner I used, etc. all had different toolkits so theming never really looked right. Every time I upgraded Ubuntu something else broke: first sleep, then wifi, then audio (pulseaudio and jack, grrr). Getting on the network after an install was usually a bit of work. Maybe DHCP worked fine, ... and maybe it didn't. Configuring X (resolution, mousewheel, etc.) was a chore, although I think that's no longer necessary. I never did get my USB thumb drives to automount, although I heard tell it was possible.

That long, twisty maze of interlocking decisions, and the fact that every decision could and did go wrong, usually necessitating several hours of troubleshooting, that's bad UX.

Don't get me wrong, I love the idea of Linux. But Mac OS X is the real Unix on the Desktop.


Yeah, my X on Linux is wonderful, but pretty far from what is offered elsewhere (or typical on Linux), and I have no confidence it would be a good X for any particular other U.


That's a nice idea; But it is a misleading one.

Even an app for your phone that communicates with a server is a distributed system.

Even a programm performing a simple read/write operation on the file_system_ is part of a distributed system. Handling exceptions is precisely a way to deal with the uncertainty of these two systems working together.


This is why the first rule of distributed systems is: Don't build a distributed system if you don't have to.

Unfortunately, a lot of problems eventually grow beyond centralized solutions and thus have to be distributed.


No.

If your server is in Virginia and you have a customer on a phone in Jo'burg... they're gonna have one sucky comprised experience.

Latency kills, and if you are centralized then your latency goes exponentially up the further away you get from that point.


You don't necessarily need to build a distributed system to solve that problem. You could have an entirely separate centralized system which serves that region.


That is a distributed system.

Unless you can make the two systems independent, it's distributed. For most applications like shopping, for example, you must have a eventually centralized database. If you have multiple centralized databases that must talk to each other, guess what, that's a distributed system.


If those multiple centralized databases handle non-overlapping concerns, it is not different than a single centralized server. Still a distributed system (client & server) of course, but that doesn't seem to be what people in this subthread are getting at.

An example of this kind of architecture would be to have server per country, where each database synchronously maintains inventory for the warehouses it is (uniquely!) responsible for.


If you now have many centralized systems...

Don't you by definition have a distributed system?

Or are you saying these are isolated units that never talk to each other? If you say yes, I would again question the legitimacy of the UX.


If you now have many centralized systems... Don't you by definition have a distributed system?

Not in the sense that it matters to the customer.

You can have a customer connected to their local centralised shipping centre before confirming an order, so every customer who would be supplied from that centre is seeing the same real-time stock indication. If supplies of a product the customer is trying to order are low, you can manage expectations accurately, for example warning that the product is almost out of stock and saying you'll confirm the order within 15 minutes to deal with races, or providing a "we're holding this for you, but you need to complete your purchase within X minutes to guarantee it" system as many on-line event ticket systems do.

You can certainly have distributed infrastructure and logistics behind those local centralised shipping centres. You still have to physically transport the products there from your manufacturing facilities or suppliers. You still have to do marketing and PR. You still have to do your corporate accounts. However, you have 100% isolated your customers from the distributed aspects of your business, and those aren't going to result in a bad experience for a real life customer just because a delivery driver was stuck in traffic for 10 minutes and arrived later than scheduled with the new stock.


On the contrary of your belief, I think your point plainly shows the author point of view "distributed systems are a UX problem". Or in other words a "business problem" which cannot be solved only at the system level.

In your case you choose to minimise the risk of not fulfilling an order by having a shop per stock. This is a perfectly acceptable choice. But there are other options which maximise sales opportunities at the price of more risks, which have to be compensated in turn by a whole chain of business/logistic/human/technical processes and mechanisms.


I wasn't aware that I was disagreeing with the original author. I think the point made by the article here is a valid and useful one. I'm just agreeing with an earlier poster that one possible strategy for dealing with this issue is to change which parts of the system are, effectively, distributed.

If the customer is directly exposed to a distributed ordering system, then it is inevitable that they will sometimes see symptoms of the kinds of technical problems we've been discussing. However you dress it up, that is never going to give the ideal buying experience to all of your customers all of the time.

On the other hand, if you can shift the distribution mechanisms behind the scenes, you still have to deal with similar sorts of problems, but only internally. Your staff can be trained to understand and deal with the kinds of race conditions and scheduling conflicts that might arise, and they aren't going to get upset and take their business elsewhere the way an angry customer might when those things happen.


I better understand now the distinction you make between a distributed system and a distributed system directly exposed to the customer. The former is a fact we have to cope with, while the latter is definitely something we can and have to minimize.


If you like the article, you might like CockroachDB. They just got funded by Google Ventures.

https://github.com/cockroachdb/cockroach#design

https://twitter.com/GoogleVentures/status/606534505332154369


I really enjoyed the analog world analogies like the work order.


At trivial scale, sure. But consider a multicore CPU or GFS. Both these are distributed systems, but where UX has no solutions.


That's only because you've drawn such tight boundaries around your definition of "distributed system" that "User Experience" is actually "API design".

The same principle applies: The system can't do everything magically, so it exposes some of that through the API. I mean, I doubt the GFS API is simply a read() and write() method that are always guaranteed to work instantly.

Instead you start posing the same kind of "Hey, I need you to make a decision here" tradeoffs that the author is talking about:

"Sometimes X will happen, if so you need to check Y and try to recover with Z, unless you didn't really care about W. If you don't want to recover and would rather discard, make sure you call Q, otherwise we'll eventually destroy it after T minutes."


I would have to disagree. One of the main benefits of distributed systems is that you don't need to know that in reality your file system is sharded on to 5 different disks at 2 different colocs. I understand that sometimes you need to manually recover, but OP sweepingly states that distributed systems is a UX problem, which I think is a ridiculous reduction and equivalent to saying "well this shit is hard so let's just spill the guts of the system internals up to the user"


GFS: Google File System (sorry should have said)


Please don't read this article. I wasted 2 mins of my life on it....


Would you care to elaborate on that?


Myself did not think it was that bad, but the article is not very concise. The first comment on the article (below it, by "An Engineer"), sums up my sentiments pretty well.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: