If a RAID5 system experiences a URE during rebuild, is all the data lost? Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Come Celebrate our 10 Year Anniversary!what is exactly an URE?Why does a URE cause loss of array during RAID 5 rebuild?Dell SAS 5/ir - secondary disk update “data will be lost!” - will all data, or just secondary disk?RAID5 Array issues - copy data or replace drive firstZFS/Btrfs/LVM2-like storage with advanced features on Linux?Failed volumes on RAID - how to handle?data lost with RAID5 on proliant DL360 when drives failWhat are the performance implications of running VMs on a ZFS host?Would this RAID 5 solution work?What are the consequences of a hot-spare dying during an array rebuild?Why would a RAID5 rebuild fail?How to 're-balance' data in zfs? (Make sure the data is spread amongst all striped mirrors)
Do wooden building fires get hotter than 600°C?
Is it cost-effective to upgrade an old-ish Giant Escape R3 commuter bike with entry-level branded parts (wheels, drivetrain)?
Do I really need recursive chmod to restrict access to a folder?
Using audio cues to encourage good posture
How do I stop a creek from eroding my steep embankment?
What do you call the main part of a joke?
When a candle burns, why does the top of wick glow if bottom of flame is hottest?
Delete nth line from bottom
Do I really need to have a message in a novel to appeal to readers?
Is there a kind of relay only consumes power when switching?
Is it fair for a professor to grade us on the possession of past papers?
Can anything be seen from the center of the Boötes void? How dark would it be?
Do square wave exist?
When the Haste spell ends on a creature, do attackers have advantage against that creature?
Wu formula for manifolds with boundary
For a new assistant professor in CS, how to build/manage a publication pipeline
Trademark violation for app?
If my PI received research grants from a company to be able to pay my postdoc salary, did I have a potential conflict interest too?
Why aren't air breathing engines used as small first stages
How do I find out the mythology and history of my Fortress?
Is "Reachable Object" really an NP-complete problem?
Amount of permutations on an NxNxN Rubik's Cube
How to show element name in portuguese using elements package?
Has negative voting ever been officially implemented in elections, or seriously proposed, or even studied?
If a RAID5 system experiences a URE during rebuild, is all the data lost?
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Come Celebrate our 10 Year Anniversary!what is exactly an URE?Why does a URE cause loss of array during RAID 5 rebuild?Dell SAS 5/ir - secondary disk update “data will be lost!” - will all data, or just secondary disk?RAID5 Array issues - copy data or replace drive firstZFS/Btrfs/LVM2-like storage with advanced features on Linux?Failed volumes on RAID - how to handle?data lost with RAID5 on proliant DL360 when drives failWhat are the performance implications of running VMs on a ZFS host?Would this RAID 5 solution work?What are the consequences of a hot-spare dying during an array rebuild?Why would a RAID5 rebuild fail?How to 're-balance' data in zfs? (Make sure the data is spread amongst all striped mirrors)
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I understand the argument regarding larger drives' increased likelihood of experiencing a URE during a rebuild, however I'm not sure what the actual implications are for this. This answer says that the entire rebuild fails, but does this mean that all the data is inaccessible? Why would that be? Surely a single URE from a single sector on the drive would only impact the data related to a few files, at most. Wouldn't the array still be rebuilt, just with some minor corruption to a few files?
(I'm specifically interested in ZFS's implementation of RAID5 here, but the logic seems the same for any RAID5 implementation.)
raid zfs zfsonlinux
|
show 5 more comments
I understand the argument regarding larger drives' increased likelihood of experiencing a URE during a rebuild, however I'm not sure what the actual implications are for this. This answer says that the entire rebuild fails, but does this mean that all the data is inaccessible? Why would that be? Surely a single URE from a single sector on the drive would only impact the data related to a few files, at most. Wouldn't the array still be rebuilt, just with some minor corruption to a few files?
(I'm specifically interested in ZFS's implementation of RAID5 here, but the logic seems the same for any RAID5 implementation.)
raid zfs zfsonlinux
1
In general, when "likelihood of experiencing a URE during a rebuild" is discussed in the context of RAID5 risks, the implied assumption is that an earlier corruption has already occurred to cause the rebuild to be necessary. In other words, the "URE during rebuild" is the second URE, and indeed ALL data will be lost.
– Colt
Oct 28 '18 at 10:18
1
@Colt - I understand that's the implication, but what I don't understand is why a single URE (which, in the analysis of why RAID5 isn't recommended, seems to refer to a bad sector) would mean that all the data would be lost. In general, if I have lost 1 drive of a RAID5 array then I still have all the data. If I additionally lose a single sector from any of the remaining drives then it is possible that I lost data which was stored in that sector, but if that sector was (for example) free space then I don't care, and if that sector did have data on it then it may only impact a few files.
– process91
Oct 28 '18 at 13:54
@Colt - Based on the answers below, it seems like failing to rebuild the array in the presense of a single URE was a choice made by hardware RAID manufacturers. In my opinion, this was the wrong choice, but thankfully it seems ZFS does it differently.
– process91
Oct 28 '18 at 13:55
See @shodanshok's answer for the process. As to the why, RAID is for providing continuity of access to reliable data for other processes, applications, etc., and is not about backup. The reason that many (most?) hardware controllers abort once the URE occurs in rebuild is that the RAID can no longer do what it is supposed to do. At this point, the backups need to be used to have reliable data. Another way to use RAID is to not do any rebuild at all, but just use RAID to control timing of recovery from backup. Also, it allows time to make the final backup before recovery.
– Colt
Oct 28 '18 at 15:37
Note that “ZFS’ implementation of RAID5” is called “raidz” or “zraid” and is different from hardware RAID5. You’ll typically get better answers about “ZFS RAID5” asking about “raidz”
– Josh
Oct 28 '18 at 15:52
|
show 5 more comments
I understand the argument regarding larger drives' increased likelihood of experiencing a URE during a rebuild, however I'm not sure what the actual implications are for this. This answer says that the entire rebuild fails, but does this mean that all the data is inaccessible? Why would that be? Surely a single URE from a single sector on the drive would only impact the data related to a few files, at most. Wouldn't the array still be rebuilt, just with some minor corruption to a few files?
(I'm specifically interested in ZFS's implementation of RAID5 here, but the logic seems the same for any RAID5 implementation.)
raid zfs zfsonlinux
I understand the argument regarding larger drives' increased likelihood of experiencing a URE during a rebuild, however I'm not sure what the actual implications are for this. This answer says that the entire rebuild fails, but does this mean that all the data is inaccessible? Why would that be? Surely a single URE from a single sector on the drive would only impact the data related to a few files, at most. Wouldn't the array still be rebuilt, just with some minor corruption to a few files?
(I'm specifically interested in ZFS's implementation of RAID5 here, but the logic seems the same for any RAID5 implementation.)
raid zfs zfsonlinux
raid zfs zfsonlinux
asked Oct 28 '18 at 2:02
process91process91
20117
20117
1
In general, when "likelihood of experiencing a URE during a rebuild" is discussed in the context of RAID5 risks, the implied assumption is that an earlier corruption has already occurred to cause the rebuild to be necessary. In other words, the "URE during rebuild" is the second URE, and indeed ALL data will be lost.
– Colt
Oct 28 '18 at 10:18
1
@Colt - I understand that's the implication, but what I don't understand is why a single URE (which, in the analysis of why RAID5 isn't recommended, seems to refer to a bad sector) would mean that all the data would be lost. In general, if I have lost 1 drive of a RAID5 array then I still have all the data. If I additionally lose a single sector from any of the remaining drives then it is possible that I lost data which was stored in that sector, but if that sector was (for example) free space then I don't care, and if that sector did have data on it then it may only impact a few files.
– process91
Oct 28 '18 at 13:54
@Colt - Based on the answers below, it seems like failing to rebuild the array in the presense of a single URE was a choice made by hardware RAID manufacturers. In my opinion, this was the wrong choice, but thankfully it seems ZFS does it differently.
– process91
Oct 28 '18 at 13:55
See @shodanshok's answer for the process. As to the why, RAID is for providing continuity of access to reliable data for other processes, applications, etc., and is not about backup. The reason that many (most?) hardware controllers abort once the URE occurs in rebuild is that the RAID can no longer do what it is supposed to do. At this point, the backups need to be used to have reliable data. Another way to use RAID is to not do any rebuild at all, but just use RAID to control timing of recovery from backup. Also, it allows time to make the final backup before recovery.
– Colt
Oct 28 '18 at 15:37
Note that “ZFS’ implementation of RAID5” is called “raidz” or “zraid” and is different from hardware RAID5. You’ll typically get better answers about “ZFS RAID5” asking about “raidz”
– Josh
Oct 28 '18 at 15:52
|
show 5 more comments
1
In general, when "likelihood of experiencing a URE during a rebuild" is discussed in the context of RAID5 risks, the implied assumption is that an earlier corruption has already occurred to cause the rebuild to be necessary. In other words, the "URE during rebuild" is the second URE, and indeed ALL data will be lost.
– Colt
Oct 28 '18 at 10:18
1
@Colt - I understand that's the implication, but what I don't understand is why a single URE (which, in the analysis of why RAID5 isn't recommended, seems to refer to a bad sector) would mean that all the data would be lost. In general, if I have lost 1 drive of a RAID5 array then I still have all the data. If I additionally lose a single sector from any of the remaining drives then it is possible that I lost data which was stored in that sector, but if that sector was (for example) free space then I don't care, and if that sector did have data on it then it may only impact a few files.
– process91
Oct 28 '18 at 13:54
@Colt - Based on the answers below, it seems like failing to rebuild the array in the presense of a single URE was a choice made by hardware RAID manufacturers. In my opinion, this was the wrong choice, but thankfully it seems ZFS does it differently.
– process91
Oct 28 '18 at 13:55
See @shodanshok's answer for the process. As to the why, RAID is for providing continuity of access to reliable data for other processes, applications, etc., and is not about backup. The reason that many (most?) hardware controllers abort once the URE occurs in rebuild is that the RAID can no longer do what it is supposed to do. At this point, the backups need to be used to have reliable data. Another way to use RAID is to not do any rebuild at all, but just use RAID to control timing of recovery from backup. Also, it allows time to make the final backup before recovery.
– Colt
Oct 28 '18 at 15:37
Note that “ZFS’ implementation of RAID5” is called “raidz” or “zraid” and is different from hardware RAID5. You’ll typically get better answers about “ZFS RAID5” asking about “raidz”
– Josh
Oct 28 '18 at 15:52
1
1
In general, when "likelihood of experiencing a URE during a rebuild" is discussed in the context of RAID5 risks, the implied assumption is that an earlier corruption has already occurred to cause the rebuild to be necessary. In other words, the "URE during rebuild" is the second URE, and indeed ALL data will be lost.
– Colt
Oct 28 '18 at 10:18
In general, when "likelihood of experiencing a URE during a rebuild" is discussed in the context of RAID5 risks, the implied assumption is that an earlier corruption has already occurred to cause the rebuild to be necessary. In other words, the "URE during rebuild" is the second URE, and indeed ALL data will be lost.
– Colt
Oct 28 '18 at 10:18
1
1
@Colt - I understand that's the implication, but what I don't understand is why a single URE (which, in the analysis of why RAID5 isn't recommended, seems to refer to a bad sector) would mean that all the data would be lost. In general, if I have lost 1 drive of a RAID5 array then I still have all the data. If I additionally lose a single sector from any of the remaining drives then it is possible that I lost data which was stored in that sector, but if that sector was (for example) free space then I don't care, and if that sector did have data on it then it may only impact a few files.
– process91
Oct 28 '18 at 13:54
@Colt - I understand that's the implication, but what I don't understand is why a single URE (which, in the analysis of why RAID5 isn't recommended, seems to refer to a bad sector) would mean that all the data would be lost. In general, if I have lost 1 drive of a RAID5 array then I still have all the data. If I additionally lose a single sector from any of the remaining drives then it is possible that I lost data which was stored in that sector, but if that sector was (for example) free space then I don't care, and if that sector did have data on it then it may only impact a few files.
– process91
Oct 28 '18 at 13:54
@Colt - Based on the answers below, it seems like failing to rebuild the array in the presense of a single URE was a choice made by hardware RAID manufacturers. In my opinion, this was the wrong choice, but thankfully it seems ZFS does it differently.
– process91
Oct 28 '18 at 13:55
@Colt - Based on the answers below, it seems like failing to rebuild the array in the presense of a single URE was a choice made by hardware RAID manufacturers. In my opinion, this was the wrong choice, but thankfully it seems ZFS does it differently.
– process91
Oct 28 '18 at 13:55
See @shodanshok's answer for the process. As to the why, RAID is for providing continuity of access to reliable data for other processes, applications, etc., and is not about backup. The reason that many (most?) hardware controllers abort once the URE occurs in rebuild is that the RAID can no longer do what it is supposed to do. At this point, the backups need to be used to have reliable data. Another way to use RAID is to not do any rebuild at all, but just use RAID to control timing of recovery from backup. Also, it allows time to make the final backup before recovery.
– Colt
Oct 28 '18 at 15:37
See @shodanshok's answer for the process. As to the why, RAID is for providing continuity of access to reliable data for other processes, applications, etc., and is not about backup. The reason that many (most?) hardware controllers abort once the URE occurs in rebuild is that the RAID can no longer do what it is supposed to do. At this point, the backups need to be used to have reliable data. Another way to use RAID is to not do any rebuild at all, but just use RAID to control timing of recovery from backup. Also, it allows time to make the final backup before recovery.
– Colt
Oct 28 '18 at 15:37
Note that “ZFS’ implementation of RAID5” is called “raidz” or “zraid” and is different from hardware RAID5. You’ll typically get better answers about “ZFS RAID5” asking about “raidz”
– Josh
Oct 28 '18 at 15:52
Note that “ZFS’ implementation of RAID5” is called “raidz” or “zraid” and is different from hardware RAID5. You’ll typically get better answers about “ZFS RAID5” asking about “raidz”
– Josh
Oct 28 '18 at 15:52
|
show 5 more comments
4 Answers
4
active
oldest
votes
It really depends on the specific RAID implementation:
most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).
linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);
ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.
@process91 Just to elaborate a bit further. If the RAID implementation doesn't have the additional data structures needed to mark individual sectors as bad, it has to either fail the rebuild or introduce silent corruption. Marking individual sectors as bad is better, but could still put other sectors at risk due to those sharing a parity sector with the bad sector.
– kasperd
Oct 28 '18 at 18:16
@kasperd Sure, I guess I assumed most RAID implementations had the capability to alert the user to bad sectors. I understand if there is a bad sector in one drive that will lead to an incorrect sector in the new drive after a rebuild. That said, even if the RAID implementation did nothing more than alert the user "I have rebuilt the drive as best as I could, but I experienced 1 URE in the process" and then continued to allow attempted writes to that sector I don't see how other sectors could be at risk. The only possible incorrect sectors would be the original, the new one, and the parity.
– process91
Oct 28 '18 at 18:41
One clarification, based on @Colt 's comments above - in the case of hardware RAID, when it marks the array as failed does it still allow access to the data at all? Even, say, read-only access for the purposes of attempted recovery?
– process91
Oct 28 '18 at 18:45
@process91 Allowing a sector to get corrupted is not considered a good idea, even if that fact was recorded to a log file. You'd have no idea which file might be corrupted. The RAID would have to ensure upon reading that file you get an error. Also clearly you don't want to just overwrite the bad sector, because that would mean you just lost your last chance of recovering the data. So you have an unreadable sector on one disk and a sector on the new disk where you don't know what to write. That could be two different files corrupted.
– kasperd
Oct 28 '18 at 18:46
1
@process91 I added a note about LSI-based arrays. Give it a look.
– shodanshok
Oct 28 '18 at 19:53
|
show 2 more comments
If URE will happen you'll experience some data corruption over the block which is typically 256KB-1MB in size, but this doesn't mean ALL the data on your volume would be lost. What's not so great about RAID5 is a totally different thing: Rebuild itself is stressful and there're high chances you'll get second disk failure in a row. In such a case all the data would be lost.
How is a RAID5 rebuild more stressful on a single drive than a RAID1 rebuild? I see that it is more stressful on the CPU, but for any specific drive we are simply reading all the data off it. Normally, the danger people cite with larger drives is that they will likely encounter a URE during the rebuild, but that's fine with me if it just means a single sector will be corrupted.
– process91
Oct 28 '18 at 10:46
2
It's probability theory. With N (where it's # of drives) your chances to have failure are N times higher.
– BaronSamedi1958
Oct 28 '18 at 15:07
That's not quite how the calculation would work, you'd actually want to calculate 1- probability of not having a failure, but I understand that part. It seems I've incorrectly interpreted your statement as suggesting that the act of rebuilding a RAID5 is somehow more stressful on the disk itself (which I've read elsewhere) which therefore increases the chance of a URE, but if that's not what you're saying then I agree.
– process91
Oct 28 '18 at 16:25
add a comment |
I would explain it the other way around;
If the RAID controller don’t stop on URE, what could happen ?
I lived it on a server, the RAID never noticed the URE and after the rebuild a corruption started to build up on the entire RAID volume.
The disk started to get more bad sector after the rebuild and the data started to be corrupt.
The disk was never kicked off the RAID volume, the controller fail is job to protect the data integrity.
That example is wrote to make you think that a controller can’t thrust a volume with URE at all, its for the data integrity, as the volume is not meant to be a backup but a resiliance to a disk failure
1
I see the new moderators are all constantly checking the site, looking for things to do...
– Ward♦
Oct 28 '18 at 2:28
@Ward haha, yeah :)
– yagmoth555♦
Oct 28 '18 at 2:32
Why would a single URE build up corruption in the entire RAID volume?
– process91
Oct 28 '18 at 10:35
1
Sorry, I reread your answer. It sounds like you had a single bad URE during the rebuild, but this wasn't the problem. The problem was that sectors continued to go bad after the rebuild, and the drive never reported it. This seems like a separate issue, however, from whether or not the RAID controller notices a URE during a rebuild. The RAID controller could notice the URE during rebuild and alert you to it but still proceed to finish the rebuild. Some data would always be better than no data.
– process91
Oct 28 '18 at 10:54
1
I'm only interested in analyzing why RAID5 was deemed as "dead" in 2009, which rests on the likelihood of a single URE. My understanding now is that this analysis was both mathematically incorrect and doesn't really apply in the same way to, for example, ZFS.
– process91
Oct 28 '18 at 11:05
|
show 5 more comments
I'd suggest reading this question and answers for a bit more background. Then go and re-read the question you linked to again.
When someone says about this situation that "the RAID failed," it means you lost the benefit of the RAID - you lost the continuous access to data that was the reason you set up the RAID array in the first place.
You haven't lost all the data, but the most common way to recover from one dead drive plus (some) UREs on (some of) the remaining drives would be to completely rebuild the array from scratch, which will mean restoring all your data from backup.
1
Generally, you use RAID when your goal is to minimize downtime. Having the array keep going with unknown and unrepaired corruption is usually counter to that goal.
– David Schwartz
Oct 28 '18 at 3:19
Thanks, that first question you linked to was very informative. Why would I have lost continuous access to the data? The array would still be up during the rebuild, and if it encounters a URE during the rebuild then I would expect it to just keep going, albeit with this one sector of data now corrupted. Is this not the case?
– process91
Oct 28 '18 at 10:45
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f937547%2fif-a-raid5-system-experiences-a-ure-during-rebuild-is-all-the-data-lost%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
It really depends on the specific RAID implementation:
most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).
linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);
ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.
@process91 Just to elaborate a bit further. If the RAID implementation doesn't have the additional data structures needed to mark individual sectors as bad, it has to either fail the rebuild or introduce silent corruption. Marking individual sectors as bad is better, but could still put other sectors at risk due to those sharing a parity sector with the bad sector.
– kasperd
Oct 28 '18 at 18:16
@kasperd Sure, I guess I assumed most RAID implementations had the capability to alert the user to bad sectors. I understand if there is a bad sector in one drive that will lead to an incorrect sector in the new drive after a rebuild. That said, even if the RAID implementation did nothing more than alert the user "I have rebuilt the drive as best as I could, but I experienced 1 URE in the process" and then continued to allow attempted writes to that sector I don't see how other sectors could be at risk. The only possible incorrect sectors would be the original, the new one, and the parity.
– process91
Oct 28 '18 at 18:41
One clarification, based on @Colt 's comments above - in the case of hardware RAID, when it marks the array as failed does it still allow access to the data at all? Even, say, read-only access for the purposes of attempted recovery?
– process91
Oct 28 '18 at 18:45
@process91 Allowing a sector to get corrupted is not considered a good idea, even if that fact was recorded to a log file. You'd have no idea which file might be corrupted. The RAID would have to ensure upon reading that file you get an error. Also clearly you don't want to just overwrite the bad sector, because that would mean you just lost your last chance of recovering the data. So you have an unreadable sector on one disk and a sector on the new disk where you don't know what to write. That could be two different files corrupted.
– kasperd
Oct 28 '18 at 18:46
1
@process91 I added a note about LSI-based arrays. Give it a look.
– shodanshok
Oct 28 '18 at 19:53
|
show 2 more comments
It really depends on the specific RAID implementation:
most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).
linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);
ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.
@process91 Just to elaborate a bit further. If the RAID implementation doesn't have the additional data structures needed to mark individual sectors as bad, it has to either fail the rebuild or introduce silent corruption. Marking individual sectors as bad is better, but could still put other sectors at risk due to those sharing a parity sector with the bad sector.
– kasperd
Oct 28 '18 at 18:16
@kasperd Sure, I guess I assumed most RAID implementations had the capability to alert the user to bad sectors. I understand if there is a bad sector in one drive that will lead to an incorrect sector in the new drive after a rebuild. That said, even if the RAID implementation did nothing more than alert the user "I have rebuilt the drive as best as I could, but I experienced 1 URE in the process" and then continued to allow attempted writes to that sector I don't see how other sectors could be at risk. The only possible incorrect sectors would be the original, the new one, and the parity.
– process91
Oct 28 '18 at 18:41
One clarification, based on @Colt 's comments above - in the case of hardware RAID, when it marks the array as failed does it still allow access to the data at all? Even, say, read-only access for the purposes of attempted recovery?
– process91
Oct 28 '18 at 18:45
@process91 Allowing a sector to get corrupted is not considered a good idea, even if that fact was recorded to a log file. You'd have no idea which file might be corrupted. The RAID would have to ensure upon reading that file you get an error. Also clearly you don't want to just overwrite the bad sector, because that would mean you just lost your last chance of recovering the data. So you have an unreadable sector on one disk and a sector on the new disk where you don't know what to write. That could be two different files corrupted.
– kasperd
Oct 28 '18 at 18:46
1
@process91 I added a note about LSI-based arrays. Give it a look.
– shodanshok
Oct 28 '18 at 19:53
|
show 2 more comments
It really depends on the specific RAID implementation:
most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).
linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);
ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.
It really depends on the specific RAID implementation:
most hardware RAID will abort the reconstruction and some will also mark the array as failed, bringing it down. The rationale is that if an URE happens during a RAID5 rebuild it means some data are lost, so it is better to completely stop the array rather that risking silent data corruption. Note: some hardware RAID (mainly LSI based) will instead puncture the array, allowing the rebuild to proceed while marking the affected sector as unreadable (similar to how Linux software RAID behaves).
linux software RAID can be instructed to a) stop the array rebuild (the only behavior of "ancient" MDRAID/kernels builds) or b) continue with the rebuild process marking some LBA as bad/inaccessible. The rationale is that it is better to let the user do his choice: after all, a single URE can be on free space, not affecting data at all (or affecting only unimportant files);
ZRAID will show some file as corrupted, but it will continue with the rebuild process (see here for an example). Again, the rationale is that it is better to continue and report back to the user, enabling him to make an informed choice.
edited Apr 12 at 5:44
answered Oct 28 '18 at 10:50
shodanshokshodanshok
26.8k34788
26.8k34788
@process91 Just to elaborate a bit further. If the RAID implementation doesn't have the additional data structures needed to mark individual sectors as bad, it has to either fail the rebuild or introduce silent corruption. Marking individual sectors as bad is better, but could still put other sectors at risk due to those sharing a parity sector with the bad sector.
– kasperd
Oct 28 '18 at 18:16
@kasperd Sure, I guess I assumed most RAID implementations had the capability to alert the user to bad sectors. I understand if there is a bad sector in one drive that will lead to an incorrect sector in the new drive after a rebuild. That said, even if the RAID implementation did nothing more than alert the user "I have rebuilt the drive as best as I could, but I experienced 1 URE in the process" and then continued to allow attempted writes to that sector I don't see how other sectors could be at risk. The only possible incorrect sectors would be the original, the new one, and the parity.
– process91
Oct 28 '18 at 18:41
One clarification, based on @Colt 's comments above - in the case of hardware RAID, when it marks the array as failed does it still allow access to the data at all? Even, say, read-only access for the purposes of attempted recovery?
– process91
Oct 28 '18 at 18:45
@process91 Allowing a sector to get corrupted is not considered a good idea, even if that fact was recorded to a log file. You'd have no idea which file might be corrupted. The RAID would have to ensure upon reading that file you get an error. Also clearly you don't want to just overwrite the bad sector, because that would mean you just lost your last chance of recovering the data. So you have an unreadable sector on one disk and a sector on the new disk where you don't know what to write. That could be two different files corrupted.
– kasperd
Oct 28 '18 at 18:46
1
@process91 I added a note about LSI-based arrays. Give it a look.
– shodanshok
Oct 28 '18 at 19:53
|
show 2 more comments
@process91 Just to elaborate a bit further. If the RAID implementation doesn't have the additional data structures needed to mark individual sectors as bad, it has to either fail the rebuild or introduce silent corruption. Marking individual sectors as bad is better, but could still put other sectors at risk due to those sharing a parity sector with the bad sector.
– kasperd
Oct 28 '18 at 18:16
@kasperd Sure, I guess I assumed most RAID implementations had the capability to alert the user to bad sectors. I understand if there is a bad sector in one drive that will lead to an incorrect sector in the new drive after a rebuild. That said, even if the RAID implementation did nothing more than alert the user "I have rebuilt the drive as best as I could, but I experienced 1 URE in the process" and then continued to allow attempted writes to that sector I don't see how other sectors could be at risk. The only possible incorrect sectors would be the original, the new one, and the parity.
– process91
Oct 28 '18 at 18:41
One clarification, based on @Colt 's comments above - in the case of hardware RAID, when it marks the array as failed does it still allow access to the data at all? Even, say, read-only access for the purposes of attempted recovery?
– process91
Oct 28 '18 at 18:45
@process91 Allowing a sector to get corrupted is not considered a good idea, even if that fact was recorded to a log file. You'd have no idea which file might be corrupted. The RAID would have to ensure upon reading that file you get an error. Also clearly you don't want to just overwrite the bad sector, because that would mean you just lost your last chance of recovering the data. So you have an unreadable sector on one disk and a sector on the new disk where you don't know what to write. That could be two different files corrupted.
– kasperd
Oct 28 '18 at 18:46
1
@process91 I added a note about LSI-based arrays. Give it a look.
– shodanshok
Oct 28 '18 at 19:53
@process91 Just to elaborate a bit further. If the RAID implementation doesn't have the additional data structures needed to mark individual sectors as bad, it has to either fail the rebuild or introduce silent corruption. Marking individual sectors as bad is better, but could still put other sectors at risk due to those sharing a parity sector with the bad sector.
– kasperd
Oct 28 '18 at 18:16
@process91 Just to elaborate a bit further. If the RAID implementation doesn't have the additional data structures needed to mark individual sectors as bad, it has to either fail the rebuild or introduce silent corruption. Marking individual sectors as bad is better, but could still put other sectors at risk due to those sharing a parity sector with the bad sector.
– kasperd
Oct 28 '18 at 18:16
@kasperd Sure, I guess I assumed most RAID implementations had the capability to alert the user to bad sectors. I understand if there is a bad sector in one drive that will lead to an incorrect sector in the new drive after a rebuild. That said, even if the RAID implementation did nothing more than alert the user "I have rebuilt the drive as best as I could, but I experienced 1 URE in the process" and then continued to allow attempted writes to that sector I don't see how other sectors could be at risk. The only possible incorrect sectors would be the original, the new one, and the parity.
– process91
Oct 28 '18 at 18:41
@kasperd Sure, I guess I assumed most RAID implementations had the capability to alert the user to bad sectors. I understand if there is a bad sector in one drive that will lead to an incorrect sector in the new drive after a rebuild. That said, even if the RAID implementation did nothing more than alert the user "I have rebuilt the drive as best as I could, but I experienced 1 URE in the process" and then continued to allow attempted writes to that sector I don't see how other sectors could be at risk. The only possible incorrect sectors would be the original, the new one, and the parity.
– process91
Oct 28 '18 at 18:41
One clarification, based on @Colt 's comments above - in the case of hardware RAID, when it marks the array as failed does it still allow access to the data at all? Even, say, read-only access for the purposes of attempted recovery?
– process91
Oct 28 '18 at 18:45
One clarification, based on @Colt 's comments above - in the case of hardware RAID, when it marks the array as failed does it still allow access to the data at all? Even, say, read-only access for the purposes of attempted recovery?
– process91
Oct 28 '18 at 18:45
@process91 Allowing a sector to get corrupted is not considered a good idea, even if that fact was recorded to a log file. You'd have no idea which file might be corrupted. The RAID would have to ensure upon reading that file you get an error. Also clearly you don't want to just overwrite the bad sector, because that would mean you just lost your last chance of recovering the data. So you have an unreadable sector on one disk and a sector on the new disk where you don't know what to write. That could be two different files corrupted.
– kasperd
Oct 28 '18 at 18:46
@process91 Allowing a sector to get corrupted is not considered a good idea, even if that fact was recorded to a log file. You'd have no idea which file might be corrupted. The RAID would have to ensure upon reading that file you get an error. Also clearly you don't want to just overwrite the bad sector, because that would mean you just lost your last chance of recovering the data. So you have an unreadable sector on one disk and a sector on the new disk where you don't know what to write. That could be two different files corrupted.
– kasperd
Oct 28 '18 at 18:46
1
1
@process91 I added a note about LSI-based arrays. Give it a look.
– shodanshok
Oct 28 '18 at 19:53
@process91 I added a note about LSI-based arrays. Give it a look.
– shodanshok
Oct 28 '18 at 19:53
|
show 2 more comments
If URE will happen you'll experience some data corruption over the block which is typically 256KB-1MB in size, but this doesn't mean ALL the data on your volume would be lost. What's not so great about RAID5 is a totally different thing: Rebuild itself is stressful and there're high chances you'll get second disk failure in a row. In such a case all the data would be lost.
How is a RAID5 rebuild more stressful on a single drive than a RAID1 rebuild? I see that it is more stressful on the CPU, but for any specific drive we are simply reading all the data off it. Normally, the danger people cite with larger drives is that they will likely encounter a URE during the rebuild, but that's fine with me if it just means a single sector will be corrupted.
– process91
Oct 28 '18 at 10:46
2
It's probability theory. With N (where it's # of drives) your chances to have failure are N times higher.
– BaronSamedi1958
Oct 28 '18 at 15:07
That's not quite how the calculation would work, you'd actually want to calculate 1- probability of not having a failure, but I understand that part. It seems I've incorrectly interpreted your statement as suggesting that the act of rebuilding a RAID5 is somehow more stressful on the disk itself (which I've read elsewhere) which therefore increases the chance of a URE, but if that's not what you're saying then I agree.
– process91
Oct 28 '18 at 16:25
add a comment |
If URE will happen you'll experience some data corruption over the block which is typically 256KB-1MB in size, but this doesn't mean ALL the data on your volume would be lost. What's not so great about RAID5 is a totally different thing: Rebuild itself is stressful and there're high chances you'll get second disk failure in a row. In such a case all the data would be lost.
How is a RAID5 rebuild more stressful on a single drive than a RAID1 rebuild? I see that it is more stressful on the CPU, but for any specific drive we are simply reading all the data off it. Normally, the danger people cite with larger drives is that they will likely encounter a URE during the rebuild, but that's fine with me if it just means a single sector will be corrupted.
– process91
Oct 28 '18 at 10:46
2
It's probability theory. With N (where it's # of drives) your chances to have failure are N times higher.
– BaronSamedi1958
Oct 28 '18 at 15:07
That's not quite how the calculation would work, you'd actually want to calculate 1- probability of not having a failure, but I understand that part. It seems I've incorrectly interpreted your statement as suggesting that the act of rebuilding a RAID5 is somehow more stressful on the disk itself (which I've read elsewhere) which therefore increases the chance of a URE, but if that's not what you're saying then I agree.
– process91
Oct 28 '18 at 16:25
add a comment |
If URE will happen you'll experience some data corruption over the block which is typically 256KB-1MB in size, but this doesn't mean ALL the data on your volume would be lost. What's not so great about RAID5 is a totally different thing: Rebuild itself is stressful and there're high chances you'll get second disk failure in a row. In such a case all the data would be lost.
If URE will happen you'll experience some data corruption over the block which is typically 256KB-1MB in size, but this doesn't mean ALL the data on your volume would be lost. What's not so great about RAID5 is a totally different thing: Rebuild itself is stressful and there're high chances you'll get second disk failure in a row. In such a case all the data would be lost.
answered Oct 28 '18 at 9:06
BaronSamedi1958BaronSamedi1958
7,45911128
7,45911128
How is a RAID5 rebuild more stressful on a single drive than a RAID1 rebuild? I see that it is more stressful on the CPU, but for any specific drive we are simply reading all the data off it. Normally, the danger people cite with larger drives is that they will likely encounter a URE during the rebuild, but that's fine with me if it just means a single sector will be corrupted.
– process91
Oct 28 '18 at 10:46
2
It's probability theory. With N (where it's # of drives) your chances to have failure are N times higher.
– BaronSamedi1958
Oct 28 '18 at 15:07
That's not quite how the calculation would work, you'd actually want to calculate 1- probability of not having a failure, but I understand that part. It seems I've incorrectly interpreted your statement as suggesting that the act of rebuilding a RAID5 is somehow more stressful on the disk itself (which I've read elsewhere) which therefore increases the chance of a URE, but if that's not what you're saying then I agree.
– process91
Oct 28 '18 at 16:25
add a comment |
How is a RAID5 rebuild more stressful on a single drive than a RAID1 rebuild? I see that it is more stressful on the CPU, but for any specific drive we are simply reading all the data off it. Normally, the danger people cite with larger drives is that they will likely encounter a URE during the rebuild, but that's fine with me if it just means a single sector will be corrupted.
– process91
Oct 28 '18 at 10:46
2
It's probability theory. With N (where it's # of drives) your chances to have failure are N times higher.
– BaronSamedi1958
Oct 28 '18 at 15:07
That's not quite how the calculation would work, you'd actually want to calculate 1- probability of not having a failure, but I understand that part. It seems I've incorrectly interpreted your statement as suggesting that the act of rebuilding a RAID5 is somehow more stressful on the disk itself (which I've read elsewhere) which therefore increases the chance of a URE, but if that's not what you're saying then I agree.
– process91
Oct 28 '18 at 16:25
How is a RAID5 rebuild more stressful on a single drive than a RAID1 rebuild? I see that it is more stressful on the CPU, but for any specific drive we are simply reading all the data off it. Normally, the danger people cite with larger drives is that they will likely encounter a URE during the rebuild, but that's fine with me if it just means a single sector will be corrupted.
– process91
Oct 28 '18 at 10:46
How is a RAID5 rebuild more stressful on a single drive than a RAID1 rebuild? I see that it is more stressful on the CPU, but for any specific drive we are simply reading all the data off it. Normally, the danger people cite with larger drives is that they will likely encounter a URE during the rebuild, but that's fine with me if it just means a single sector will be corrupted.
– process91
Oct 28 '18 at 10:46
2
2
It's probability theory. With N (where it's # of drives) your chances to have failure are N times higher.
– BaronSamedi1958
Oct 28 '18 at 15:07
It's probability theory. With N (where it's # of drives) your chances to have failure are N times higher.
– BaronSamedi1958
Oct 28 '18 at 15:07
That's not quite how the calculation would work, you'd actually want to calculate 1- probability of not having a failure, but I understand that part. It seems I've incorrectly interpreted your statement as suggesting that the act of rebuilding a RAID5 is somehow more stressful on the disk itself (which I've read elsewhere) which therefore increases the chance of a URE, but if that's not what you're saying then I agree.
– process91
Oct 28 '18 at 16:25
That's not quite how the calculation would work, you'd actually want to calculate 1- probability of not having a failure, but I understand that part. It seems I've incorrectly interpreted your statement as suggesting that the act of rebuilding a RAID5 is somehow more stressful on the disk itself (which I've read elsewhere) which therefore increases the chance of a URE, but if that's not what you're saying then I agree.
– process91
Oct 28 '18 at 16:25
add a comment |
I would explain it the other way around;
If the RAID controller don’t stop on URE, what could happen ?
I lived it on a server, the RAID never noticed the URE and after the rebuild a corruption started to build up on the entire RAID volume.
The disk started to get more bad sector after the rebuild and the data started to be corrupt.
The disk was never kicked off the RAID volume, the controller fail is job to protect the data integrity.
That example is wrote to make you think that a controller can’t thrust a volume with URE at all, its for the data integrity, as the volume is not meant to be a backup but a resiliance to a disk failure
1
I see the new moderators are all constantly checking the site, looking for things to do...
– Ward♦
Oct 28 '18 at 2:28
@Ward haha, yeah :)
– yagmoth555♦
Oct 28 '18 at 2:32
Why would a single URE build up corruption in the entire RAID volume?
– process91
Oct 28 '18 at 10:35
1
Sorry, I reread your answer. It sounds like you had a single bad URE during the rebuild, but this wasn't the problem. The problem was that sectors continued to go bad after the rebuild, and the drive never reported it. This seems like a separate issue, however, from whether or not the RAID controller notices a URE during a rebuild. The RAID controller could notice the URE during rebuild and alert you to it but still proceed to finish the rebuild. Some data would always be better than no data.
– process91
Oct 28 '18 at 10:54
1
I'm only interested in analyzing why RAID5 was deemed as "dead" in 2009, which rests on the likelihood of a single URE. My understanding now is that this analysis was both mathematically incorrect and doesn't really apply in the same way to, for example, ZFS.
– process91
Oct 28 '18 at 11:05
|
show 5 more comments
I would explain it the other way around;
If the RAID controller don’t stop on URE, what could happen ?
I lived it on a server, the RAID never noticed the URE and after the rebuild a corruption started to build up on the entire RAID volume.
The disk started to get more bad sector after the rebuild and the data started to be corrupt.
The disk was never kicked off the RAID volume, the controller fail is job to protect the data integrity.
That example is wrote to make you think that a controller can’t thrust a volume with URE at all, its for the data integrity, as the volume is not meant to be a backup but a resiliance to a disk failure
1
I see the new moderators are all constantly checking the site, looking for things to do...
– Ward♦
Oct 28 '18 at 2:28
@Ward haha, yeah :)
– yagmoth555♦
Oct 28 '18 at 2:32
Why would a single URE build up corruption in the entire RAID volume?
– process91
Oct 28 '18 at 10:35
1
Sorry, I reread your answer. It sounds like you had a single bad URE during the rebuild, but this wasn't the problem. The problem was that sectors continued to go bad after the rebuild, and the drive never reported it. This seems like a separate issue, however, from whether or not the RAID controller notices a URE during a rebuild. The RAID controller could notice the URE during rebuild and alert you to it but still proceed to finish the rebuild. Some data would always be better than no data.
– process91
Oct 28 '18 at 10:54
1
I'm only interested in analyzing why RAID5 was deemed as "dead" in 2009, which rests on the likelihood of a single URE. My understanding now is that this analysis was both mathematically incorrect and doesn't really apply in the same way to, for example, ZFS.
– process91
Oct 28 '18 at 11:05
|
show 5 more comments
I would explain it the other way around;
If the RAID controller don’t stop on URE, what could happen ?
I lived it on a server, the RAID never noticed the URE and after the rebuild a corruption started to build up on the entire RAID volume.
The disk started to get more bad sector after the rebuild and the data started to be corrupt.
The disk was never kicked off the RAID volume, the controller fail is job to protect the data integrity.
That example is wrote to make you think that a controller can’t thrust a volume with URE at all, its for the data integrity, as the volume is not meant to be a backup but a resiliance to a disk failure
I would explain it the other way around;
If the RAID controller don’t stop on URE, what could happen ?
I lived it on a server, the RAID never noticed the URE and after the rebuild a corruption started to build up on the entire RAID volume.
The disk started to get more bad sector after the rebuild and the data started to be corrupt.
The disk was never kicked off the RAID volume, the controller fail is job to protect the data integrity.
That example is wrote to make you think that a controller can’t thrust a volume with URE at all, its for the data integrity, as the volume is not meant to be a backup but a resiliance to a disk failure
answered Oct 28 '18 at 2:18
yagmoth555♦yagmoth555
12.4k31842
12.4k31842
1
I see the new moderators are all constantly checking the site, looking for things to do...
– Ward♦
Oct 28 '18 at 2:28
@Ward haha, yeah :)
– yagmoth555♦
Oct 28 '18 at 2:32
Why would a single URE build up corruption in the entire RAID volume?
– process91
Oct 28 '18 at 10:35
1
Sorry, I reread your answer. It sounds like you had a single bad URE during the rebuild, but this wasn't the problem. The problem was that sectors continued to go bad after the rebuild, and the drive never reported it. This seems like a separate issue, however, from whether or not the RAID controller notices a URE during a rebuild. The RAID controller could notice the URE during rebuild and alert you to it but still proceed to finish the rebuild. Some data would always be better than no data.
– process91
Oct 28 '18 at 10:54
1
I'm only interested in analyzing why RAID5 was deemed as "dead" in 2009, which rests on the likelihood of a single URE. My understanding now is that this analysis was both mathematically incorrect and doesn't really apply in the same way to, for example, ZFS.
– process91
Oct 28 '18 at 11:05
|
show 5 more comments
1
I see the new moderators are all constantly checking the site, looking for things to do...
– Ward♦
Oct 28 '18 at 2:28
@Ward haha, yeah :)
– yagmoth555♦
Oct 28 '18 at 2:32
Why would a single URE build up corruption in the entire RAID volume?
– process91
Oct 28 '18 at 10:35
1
Sorry, I reread your answer. It sounds like you had a single bad URE during the rebuild, but this wasn't the problem. The problem was that sectors continued to go bad after the rebuild, and the drive never reported it. This seems like a separate issue, however, from whether or not the RAID controller notices a URE during a rebuild. The RAID controller could notice the URE during rebuild and alert you to it but still proceed to finish the rebuild. Some data would always be better than no data.
– process91
Oct 28 '18 at 10:54
1
I'm only interested in analyzing why RAID5 was deemed as "dead" in 2009, which rests on the likelihood of a single URE. My understanding now is that this analysis was both mathematically incorrect and doesn't really apply in the same way to, for example, ZFS.
– process91
Oct 28 '18 at 11:05
1
1
I see the new moderators are all constantly checking the site, looking for things to do...
– Ward♦
Oct 28 '18 at 2:28
I see the new moderators are all constantly checking the site, looking for things to do...
– Ward♦
Oct 28 '18 at 2:28
@Ward haha, yeah :)
– yagmoth555♦
Oct 28 '18 at 2:32
@Ward haha, yeah :)
– yagmoth555♦
Oct 28 '18 at 2:32
Why would a single URE build up corruption in the entire RAID volume?
– process91
Oct 28 '18 at 10:35
Why would a single URE build up corruption in the entire RAID volume?
– process91
Oct 28 '18 at 10:35
1
1
Sorry, I reread your answer. It sounds like you had a single bad URE during the rebuild, but this wasn't the problem. The problem was that sectors continued to go bad after the rebuild, and the drive never reported it. This seems like a separate issue, however, from whether or not the RAID controller notices a URE during a rebuild. The RAID controller could notice the URE during rebuild and alert you to it but still proceed to finish the rebuild. Some data would always be better than no data.
– process91
Oct 28 '18 at 10:54
Sorry, I reread your answer. It sounds like you had a single bad URE during the rebuild, but this wasn't the problem. The problem was that sectors continued to go bad after the rebuild, and the drive never reported it. This seems like a separate issue, however, from whether or not the RAID controller notices a URE during a rebuild. The RAID controller could notice the URE during rebuild and alert you to it but still proceed to finish the rebuild. Some data would always be better than no data.
– process91
Oct 28 '18 at 10:54
1
1
I'm only interested in analyzing why RAID5 was deemed as "dead" in 2009, which rests on the likelihood of a single URE. My understanding now is that this analysis was both mathematically incorrect and doesn't really apply in the same way to, for example, ZFS.
– process91
Oct 28 '18 at 11:05
I'm only interested in analyzing why RAID5 was deemed as "dead" in 2009, which rests on the likelihood of a single URE. My understanding now is that this analysis was both mathematically incorrect and doesn't really apply in the same way to, for example, ZFS.
– process91
Oct 28 '18 at 11:05
|
show 5 more comments
I'd suggest reading this question and answers for a bit more background. Then go and re-read the question you linked to again.
When someone says about this situation that "the RAID failed," it means you lost the benefit of the RAID - you lost the continuous access to data that was the reason you set up the RAID array in the first place.
You haven't lost all the data, but the most common way to recover from one dead drive plus (some) UREs on (some of) the remaining drives would be to completely rebuild the array from scratch, which will mean restoring all your data from backup.
1
Generally, you use RAID when your goal is to minimize downtime. Having the array keep going with unknown and unrepaired corruption is usually counter to that goal.
– David Schwartz
Oct 28 '18 at 3:19
Thanks, that first question you linked to was very informative. Why would I have lost continuous access to the data? The array would still be up during the rebuild, and if it encounters a URE during the rebuild then I would expect it to just keep going, albeit with this one sector of data now corrupted. Is this not the case?
– process91
Oct 28 '18 at 10:45
add a comment |
I'd suggest reading this question and answers for a bit more background. Then go and re-read the question you linked to again.
When someone says about this situation that "the RAID failed," it means you lost the benefit of the RAID - you lost the continuous access to data that was the reason you set up the RAID array in the first place.
You haven't lost all the data, but the most common way to recover from one dead drive plus (some) UREs on (some of) the remaining drives would be to completely rebuild the array from scratch, which will mean restoring all your data from backup.
1
Generally, you use RAID when your goal is to minimize downtime. Having the array keep going with unknown and unrepaired corruption is usually counter to that goal.
– David Schwartz
Oct 28 '18 at 3:19
Thanks, that first question you linked to was very informative. Why would I have lost continuous access to the data? The array would still be up during the rebuild, and if it encounters a URE during the rebuild then I would expect it to just keep going, albeit with this one sector of data now corrupted. Is this not the case?
– process91
Oct 28 '18 at 10:45
add a comment |
I'd suggest reading this question and answers for a bit more background. Then go and re-read the question you linked to again.
When someone says about this situation that "the RAID failed," it means you lost the benefit of the RAID - you lost the continuous access to data that was the reason you set up the RAID array in the first place.
You haven't lost all the data, but the most common way to recover from one dead drive plus (some) UREs on (some of) the remaining drives would be to completely rebuild the array from scratch, which will mean restoring all your data from backup.
I'd suggest reading this question and answers for a bit more background. Then go and re-read the question you linked to again.
When someone says about this situation that "the RAID failed," it means you lost the benefit of the RAID - you lost the continuous access to data that was the reason you set up the RAID array in the first place.
You haven't lost all the data, but the most common way to recover from one dead drive plus (some) UREs on (some of) the remaining drives would be to completely rebuild the array from scratch, which will mean restoring all your data from backup.
answered Oct 28 '18 at 2:28
Ward♦Ward
11.7k73956
11.7k73956
1
Generally, you use RAID when your goal is to minimize downtime. Having the array keep going with unknown and unrepaired corruption is usually counter to that goal.
– David Schwartz
Oct 28 '18 at 3:19
Thanks, that first question you linked to was very informative. Why would I have lost continuous access to the data? The array would still be up during the rebuild, and if it encounters a URE during the rebuild then I would expect it to just keep going, albeit with this one sector of data now corrupted. Is this not the case?
– process91
Oct 28 '18 at 10:45
add a comment |
1
Generally, you use RAID when your goal is to minimize downtime. Having the array keep going with unknown and unrepaired corruption is usually counter to that goal.
– David Schwartz
Oct 28 '18 at 3:19
Thanks, that first question you linked to was very informative. Why would I have lost continuous access to the data? The array would still be up during the rebuild, and if it encounters a URE during the rebuild then I would expect it to just keep going, albeit with this one sector of data now corrupted. Is this not the case?
– process91
Oct 28 '18 at 10:45
1
1
Generally, you use RAID when your goal is to minimize downtime. Having the array keep going with unknown and unrepaired corruption is usually counter to that goal.
– David Schwartz
Oct 28 '18 at 3:19
Generally, you use RAID when your goal is to minimize downtime. Having the array keep going with unknown and unrepaired corruption is usually counter to that goal.
– David Schwartz
Oct 28 '18 at 3:19
Thanks, that first question you linked to was very informative. Why would I have lost continuous access to the data? The array would still be up during the rebuild, and if it encounters a URE during the rebuild then I would expect it to just keep going, albeit with this one sector of data now corrupted. Is this not the case?
– process91
Oct 28 '18 at 10:45
Thanks, that first question you linked to was very informative. Why would I have lost continuous access to the data? The array would still be up during the rebuild, and if it encounters a URE during the rebuild then I would expect it to just keep going, albeit with this one sector of data now corrupted. Is this not the case?
– process91
Oct 28 '18 at 10:45
add a comment |
Thanks for contributing an answer to Server Fault!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f937547%2fif-a-raid5-system-experiences-a-ure-during-rebuild-is-all-the-data-lost%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
In general, when "likelihood of experiencing a URE during a rebuild" is discussed in the context of RAID5 risks, the implied assumption is that an earlier corruption has already occurred to cause the rebuild to be necessary. In other words, the "URE during rebuild" is the second URE, and indeed ALL data will be lost.
– Colt
Oct 28 '18 at 10:18
1
@Colt - I understand that's the implication, but what I don't understand is why a single URE (which, in the analysis of why RAID5 isn't recommended, seems to refer to a bad sector) would mean that all the data would be lost. In general, if I have lost 1 drive of a RAID5 array then I still have all the data. If I additionally lose a single sector from any of the remaining drives then it is possible that I lost data which was stored in that sector, but if that sector was (for example) free space then I don't care, and if that sector did have data on it then it may only impact a few files.
– process91
Oct 28 '18 at 13:54
@Colt - Based on the answers below, it seems like failing to rebuild the array in the presense of a single URE was a choice made by hardware RAID manufacturers. In my opinion, this was the wrong choice, but thankfully it seems ZFS does it differently.
– process91
Oct 28 '18 at 13:55
See @shodanshok's answer for the process. As to the why, RAID is for providing continuity of access to reliable data for other processes, applications, etc., and is not about backup. The reason that many (most?) hardware controllers abort once the URE occurs in rebuild is that the RAID can no longer do what it is supposed to do. At this point, the backups need to be used to have reliable data. Another way to use RAID is to not do any rebuild at all, but just use RAID to control timing of recovery from backup. Also, it allows time to make the final backup before recovery.
– Colt
Oct 28 '18 at 15:37
Note that “ZFS’ implementation of RAID5” is called “raidz” or “zraid” and is different from hardware RAID5. You’ll typically get better answers about “ZFS RAID5” asking about “raidz”
– Josh
Oct 28 '18 at 15:52