But that's reliability, not performance. For performance I think the problem is ...

zozbot234 · on April 15, 2020

Well, the OP mentions that these drives are dropping out of RAID arrays during rebuilding operations. Some people might describe that as a reliability concern.

cm2187 · on April 15, 2020

Right. Though I doubt backblaze is using RAID in the first place.

prirun · on April 15, 2020

Backblaze does use RAID (erasure coding): data is striped across 20 drives, with 17 needed to complete a read. See:

https://www.backblaze.com/b2/storage-pod.html

"For Backblaze Vault Storage Pods each is one of 20 pods needed to create a Backblaze Vault. A Backblaze Vault divides up a file into 20 pieces (17 data and 3 parity) and places a piece of the file on each of the 20 Storage Pods in the Vault. We use our own implementation of Reed-Solomon to encode and distribute the files across the 20 pods, achieving 99.99999% data durability. We open-sourced our Reed-Solomon encoding implementation as well."

labawi · on April 15, 2020

They do not use RAID. They use userspace erasure coding of data stored in files on plain ext4 IIRC.

Maybe they use RAID for the OS.

jandrese · on April 15, 2020

It's software RAID. RAID isn't a piece of hardware, it's a system design.

labawi · on April 15, 2020

Yes it is a system design.

Backblaze's main reliability system is erasure coding[4] of shards of files, which is not RAID[1][2]. RAID mirrors disks or volumes on block or filesystem layers, not files in userspace.

That being said, I stand corrected in that they did use RAID at some point in time[3] (RAID6 on 13+2 drives), and may still be using some form of RAID. OTOH their reliability calculations don't seem to consider RAID - calculating with individual drive failures and shard rebuilds.

[1] https://en.wikipedia.org/wiki/RAID

[2] https://en.wikipedia.org/wiki/Non-RAID_drive_architectures

[3] https://www.backblaze.com/b2/storage-pod.html

[4] https://www.backblaze.com/blog/cloud-storage-durability/

zdkl · on April 15, 2020

As a lay person I don't understand why they wouldn't, could you give some more information as to why?

cm2187 · on April 15, 2020

For these large scale infrastructures, they typically use JBOD, i.e. no RAID whatsoever and they achieve redundancy by maintaining multiple copies of the data over multiple disks spread around the datacentre. So it's not hardware RAID that requires a certain response time (I suspect the drives mentioned here drop because they timeout as it take a long time to random write on an SMR disk), more like a distributed software RAID. I think they likely also have their own file system so they are not dependent on a fixed block size, which allows them to save space if they have loads of small files.

smueller1234 · on April 15, 2020

I think that's about half right: for large storage infrastructure, you would indeed eschew local RAID, but instead of storing multiple copies, you'd use Reed Solomon encoding to stripe a single copy across many disks/servers/failure domains with a configurable number of added parity stripes. Full copies are really expensive!

cm2187 · on April 15, 2020

I believe azure uses full copies [1]. I assume the other cloud providers must do the same.

[1] https://docs.microsoft.com/en-us/azure/storage/common/storag...

namibj · on April 15, 2020

Google seems to use erasure coding for some (iirc. multi-AZ nearline, or so) storage classes, last I checked.

evntdrvn · on April 15, 2020

https://www.backblaze.com/blog/reed-solomon/

klodolph · on April 15, 2020

RAID only protects you from disk failures, not machine failures. If you want to protect against machine failures (whole server offline), which might be an availability concern, then you would naturally want to replicate the data at a higher level. Once you’re doing that, it makes sense to only replicate the data at a higher level, because for a given level of safety, it is more efficient to replicate the data once, in one layer, than to replicate the data multiple times, at multiple layers.

RAID is only effective for workloads small enough that you care about a single machine.

sp332 · on April 15, 2020

It's a similar idea but they distribute the copies between machines, not just across disks in the same machine. https://www.backblaze.com/blog/reed-solomon/ So the drives in each box are not in a RAID configuration.

AnssiH · on April 15, 2020

Why is that happening?

Slow performance I could understand (though RAID rebuild is usually sequential, not random), but how does SMR cause these "dropouts"?

cm2187 · on April 15, 2020

Hardware RAID typically requires a disk to respond within a certain number of ms, either reporting a successful write or a failure to write. SMR, because it requires a whole section of the drive to be re-written everytime you need to modify a single block of that section, may have very slow response times, and result in timeouts. On a timeout, the hardware RAID controller assumes the disks is offline and drops the disk from the array.

The same happened with WD Green drives, which are not graded for RAID. Their error correction logic typically allows for a lot more attempts to read the data than a datacentre drive, resulting in timeouts and the drive dropping out of the array (which is why it is a very bad idea to put a consumer drive in a hardware RAID array).

Now WD Red NAS are meant for RAID arrays but I suspect WD assumed they would be used for software RAID only, which typically doesn't have those timeouts. But if so, it should be clearly stated.

effie · on April 15, 2020

Isn't that behaviour of Greens only a problem when the drive goes bad? In which case you want it to get kicked out of the array. Or do you mean this will happen to a new Green drive with too high a probability?

cm2187 · on April 15, 2020

The problem is that taking some time to read the data doesn't mean the drive has gone bad, at worst perhaps a sector has gone bad, and even then, it might even still be readable, just a bit slow. And apparently it is very common on large capacity green disks after a few months.

The typical bad scenario (and I have been burned by this) is that let's say you use RAID5, and one sector goes bad. While the disk tries to read it, the RAID controller kicks the disk out of the array because of the timeout. Now you need to replace the disk and rebuild the array. In the rebuilt, as you are doing a full read or all disks, you are pretty likely to find another sector that is slow to read on one of the other disks (particularly on multi-TB disks). And then the controller will kick another disk from the array. And now you lost everything.

Also why data scrubbing is pretty important in NAS.

effie · on April 15, 2020

Thanks for explanation. That was surely a bad experience. Luckily there are ways to overcome this incompatibility, such as increasing the timeout value for a drive in the OS, see the Linux wiki

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

cmurf · on April 15, 2020

They have long (deep) error recoveries, and most of them don't have configurable SCT ERC. If used on Linux, where the default command (queue) timer is 30 seconds, a bad sector could result in the drive not reporting success or failure well beyond 30 seconds at which point the kernel does a link reset thinking the drive is unresponsive. On SATA drives, resetting the drive clears the whole queue, all commands are lost, so it's not clear what sector has the problem. This prevents whatever healing properties RAID has. e.g. on md RAID (includes mdadm and LVM managed RAIDs these days) an explicit read error by the drive comes with a sector LBA value, and md will know what stripe its a member of (or its mirror) and will overwrite the bad sector with reconstructed data, fixing it. But if there are either write errors or a bunch of link resets, md will consider the drive faulty (kicking it).

https://raid.wiki.kernel.org/index.php/Timeout_Mismatch

It's important to know this kernel command timer and SCT ERC mismatch is (a) common and (b) affects mdadm, LVM, Btrfs and maybe ZFS RAID on Linux. I'm not really sure whether the kernel command timer applies to ZoL or if a vdev has its own policy.

effie · on April 15, 2020

Thanks for explaining. The kernel wiki describes an easy solution to this.

mrjin · on April 15, 2020

That's definitely a reliability concern IMO: critical operations cannot be completed in expected time.