3Ware RAID6 array sometimes hanging. Undetected broken disk? The 2019 Stack Overflow Developer Survey Results Are InConfirm disk is broken when it passes all diagnosticsExtending a live 3Ware RAID6 array in Linux with tw_cli3Ware 9650SE is rebuilding RAID6 array with two degraded disks?3ware 9650SE: This Spare unit may replace failed drive of same interface type onlyRebuilding array on 3ware 9690SA-8IDoes a 3ware “ECC-ERROR” matter on a JBOD when I have ZFS?3Ware 9500s Raid array not visible in Windows 7 x64 disk management - latest firmware and drivers3ware array limited to 6TB in linuxCan I recreate a unit from a single drive RAID1 in a 3ware Raid controller?3ware 9650se raid6 replaced with larger drivesError: (CLI:144) Invalid drive(s) specified when trying to rebuild 3ware RAID

How do I free up internal storage if I don't have any apps downloaded?

How come people say “Would of”?

Is it safe to harvest rainwater that fell on solar panels?

Why isn't the circumferential light around the M87 black hole's event horizon symmetric?

If I score a critical hit on an 18 or higher, what are my chances of getting a critical hit if I roll 3d20?

"as much details as you can remember"

Did any laptop computers have a built-in 5 1/4 inch floppy drive?

Kerning for subscripts of sigma?

How can I add encounters in the Lost Mine of Phandelver campaign without giving PCs too much XP?

How to display lines in a file like ls displays files in a directory?

Why doesn't UInt have a toDouble()?

Accepted by European university, rejected by all American ones I applied to? Possible reasons?

Likelihood that a superbug or lethal virus could come from a landfill

Output the Arecibo Message

The phrase "to the numbers born"?

How to obtain a position of last non-zero element

Worn-tile Scrabble

What do hard-Brexiteers want with respect to the Irish border?

Why don't hard Brexiteers insist on a hard border to prevent illegal immigration after Brexit?

Will it cause any balance problems to have PCs level up and gain the benefits of a long rest mid-fight?

Correct punctuation for showing a character's confusion

Can we generate random numbers using irrational numbers like π and e?

Is an up-to-date browser secure on an out-of-date OS?

Does HR tell a hiring manager about salary negotiations?



3Ware RAID6 array sometimes hanging. Undetected broken disk?



The 2019 Stack Overflow Developer Survey Results Are InConfirm disk is broken when it passes all diagnosticsExtending a live 3Ware RAID6 array in Linux with tw_cli3Ware 9650SE is rebuilding RAID6 array with two degraded disks?3ware 9650SE: This Spare unit may replace failed drive of same interface type onlyRebuilding array on 3ware 9690SA-8IDoes a 3ware “ECC-ERROR” matter on a JBOD when I have ZFS?3Ware 9500s Raid array not visible in Windows 7 x64 disk management - latest firmware and drivers3ware array limited to 6TB in linuxCan I recreate a unit from a single drive RAID1 in a 3ware Raid controller?3ware 9650se raid6 replaced with larger drivesError: (CLI:144) Invalid drive(s) specified when trying to rebuild 3ware RAID



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;








12















We have a Debian server with 3Ware 9650SE 8-drive RAID controller, with 5 disk RAID6 array, acting as virtual machine host, all Linux. Problems keep occurring and I suspect an undetected broken disk.



We have had several crashes now where both host and all guests are saying that the IO system blocked for 120 seconds or more. We suspected a faulty RAID controller, but we replaced it with an identical one with identical firmware, which didn't fix it. I didn't think it would, because a second RAID1 array kept working properly.



Almost a week ago (Sunday), when this was acting up, the auto verify was at 66%. Last night (friday morning) it was at 67%. Both before and after booting, and both while experiencing problems. When I turned off the verify with tw_cli /c0/u0 stop verify, things became responsive again.



I suspect it got stuck on a disk fault at around 66%. An auto verify starts on Saturday:



# tw_cli /c0 show verify
/c0 basic verify weekly preferred start: Saturday, 12:00AM


and would normally be long done by Friday. Seeing as how Sunday was 66% and Friday was 67%, it's unlikely to be coincidence.



'smartctl -a -d 3ware,0 /dev/twa0' and 'smartctl -t long' (long SMART self test) on all the drives didn't reveal any errors. Neither does tw_cli /c0 show alarms.



I suspected a disk is broken in a way that is hard to detect, but I took each drive out of the array one by one, created a 'single' array from it and dd'ed full of zeros. No disk showed errors.



Or any other advice?



Edit:



this is the layout:



# tw_cli /c0 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 5587.9 RiW OFF
u1 SPARE OK - - - 1863.01 - OFF
u2 RAID-1 OK - - - 1862.63 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 1.82 TB SATA 0 - ST32000542AS
p1 OK u0 1.82 TB SATA 1 - ST32000542AS
p2 OK u0 1.82 TB SATA 2 - ST32000542AS
p3 OK u0 1.82 TB SATA 3 - ST32000542AS
p4 OK u0 1.82 TB SATA 4 - ST32000542AS
p5 OK u1 1.82 TB SATA 5 - WDC WD2002FYPS-02W3
p6 OK u2 1.82 TB SATA 6 - WDC WD2002FYPS-02W3
p7 OK u2 1.82 TB SATA 7 - WDC WD2002FYPS-02W3

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx


The unit in question is u0.



edit2:



tw_cli /c0 show diag shows something interesting (edit3: this is harmless, I found out it's caused by calling smartctl -a -d 3ware,X /dev/twa0 where X is an invalid port):



QueueAtaPassthrough() called with invalid TargetHandle: 0x17, portHandle: 0xFF

Legacy opcode=0xB1 error=0x10E

E=010E T=14:15:51 : Invalid operation for specified port
E=010E T=14:15:51 U=0 : Return error status to host
Error, Unit 23: Invalid operation for specified port
(EC:0x10e, SK=0x05, ASC=0x24, ASCQ=0x00, SEV=01, Type=0x70)
No additional sense data
Error, Unit 23: 0x10E OVERRIDDEN due to invalid sense buffer descriptor
sense buffer: len=0, address=0x414ca2c7c
Send AEN (code, time): 0031h, 06/21/2013 14:26:16
Synchronize host/controller time
(EC:0x31, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=04, Type=0x71)


I get tons of these. I have no idea what it means though. I can't even make out which unit or port it is. (edit3: I do know now, it's harmless).



Given my edit3, I'm back to square one. Nothing indicates a disk is broken, except that the verify hangs at 66% and causes the array to hang, which also sometimes happens randomly. I wish the verify would find the fault...










share|improve this question
























  • What HDDs? Are they officially supported?

    – grs
    Jun 21 '13 at 14:08












  • I added the layout. The disks are ST32000542AS. They are supported, but more over, the server worked fine for 3 years.

    – Halfgaar
    Jun 21 '13 at 14:26











  • I had issue with some WD drives becoming verrryyyy slow at some point. only a hdparm (not doable here sadly) showed me a throughoutput of ~300KB/s (yes K!) instead of the usual 80~100MB/s.

    – Benjamin Sonntag
    Jun 28 '14 at 8:30






  • 1





    the difference between the Enterprise and the disks of desktop drives is that they work out the bugs. If Enterprise drives encounters any error, the disc falls out of the raid. (as in companies sensitive to the risks of storing data and are willing to pay for it) If a desktop drive meets the fault, it will try to last until all timeouts will end. (As users one drive, and to the data they need to reach, and if they fall out discs at once, the manufacturer will be very painful) apparently ST32000542AS is quiet and economical desktop version discs. for example goo.gl/rWb5lj

    – Rainbow-
    Sep 8 '14 at 14:04











  • Actually, just recently, this server suddenly hung, differently and more severely than the original problem, and the logs showed a timeout on a RAID port. The timeout was on one of the enterprise drives (of which this server has more now).

    – Halfgaar
    Sep 9 '14 at 7:15

















12















We have a Debian server with 3Ware 9650SE 8-drive RAID controller, with 5 disk RAID6 array, acting as virtual machine host, all Linux. Problems keep occurring and I suspect an undetected broken disk.



We have had several crashes now where both host and all guests are saying that the IO system blocked for 120 seconds or more. We suspected a faulty RAID controller, but we replaced it with an identical one with identical firmware, which didn't fix it. I didn't think it would, because a second RAID1 array kept working properly.



Almost a week ago (Sunday), when this was acting up, the auto verify was at 66%. Last night (friday morning) it was at 67%. Both before and after booting, and both while experiencing problems. When I turned off the verify with tw_cli /c0/u0 stop verify, things became responsive again.



I suspect it got stuck on a disk fault at around 66%. An auto verify starts on Saturday:



# tw_cli /c0 show verify
/c0 basic verify weekly preferred start: Saturday, 12:00AM


and would normally be long done by Friday. Seeing as how Sunday was 66% and Friday was 67%, it's unlikely to be coincidence.



'smartctl -a -d 3ware,0 /dev/twa0' and 'smartctl -t long' (long SMART self test) on all the drives didn't reveal any errors. Neither does tw_cli /c0 show alarms.



I suspected a disk is broken in a way that is hard to detect, but I took each drive out of the array one by one, created a 'single' array from it and dd'ed full of zeros. No disk showed errors.



Or any other advice?



Edit:



this is the layout:



# tw_cli /c0 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 5587.9 RiW OFF
u1 SPARE OK - - - 1863.01 - OFF
u2 RAID-1 OK - - - 1862.63 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 1.82 TB SATA 0 - ST32000542AS
p1 OK u0 1.82 TB SATA 1 - ST32000542AS
p2 OK u0 1.82 TB SATA 2 - ST32000542AS
p3 OK u0 1.82 TB SATA 3 - ST32000542AS
p4 OK u0 1.82 TB SATA 4 - ST32000542AS
p5 OK u1 1.82 TB SATA 5 - WDC WD2002FYPS-02W3
p6 OK u2 1.82 TB SATA 6 - WDC WD2002FYPS-02W3
p7 OK u2 1.82 TB SATA 7 - WDC WD2002FYPS-02W3

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx


The unit in question is u0.



edit2:



tw_cli /c0 show diag shows something interesting (edit3: this is harmless, I found out it's caused by calling smartctl -a -d 3ware,X /dev/twa0 where X is an invalid port):



QueueAtaPassthrough() called with invalid TargetHandle: 0x17, portHandle: 0xFF

Legacy opcode=0xB1 error=0x10E

E=010E T=14:15:51 : Invalid operation for specified port
E=010E T=14:15:51 U=0 : Return error status to host
Error, Unit 23: Invalid operation for specified port
(EC:0x10e, SK=0x05, ASC=0x24, ASCQ=0x00, SEV=01, Type=0x70)
No additional sense data
Error, Unit 23: 0x10E OVERRIDDEN due to invalid sense buffer descriptor
sense buffer: len=0, address=0x414ca2c7c
Send AEN (code, time): 0031h, 06/21/2013 14:26:16
Synchronize host/controller time
(EC:0x31, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=04, Type=0x71)


I get tons of these. I have no idea what it means though. I can't even make out which unit or port it is. (edit3: I do know now, it's harmless).



Given my edit3, I'm back to square one. Nothing indicates a disk is broken, except that the verify hangs at 66% and causes the array to hang, which also sometimes happens randomly. I wish the verify would find the fault...










share|improve this question
























  • What HDDs? Are they officially supported?

    – grs
    Jun 21 '13 at 14:08












  • I added the layout. The disks are ST32000542AS. They are supported, but more over, the server worked fine for 3 years.

    – Halfgaar
    Jun 21 '13 at 14:26











  • I had issue with some WD drives becoming verrryyyy slow at some point. only a hdparm (not doable here sadly) showed me a throughoutput of ~300KB/s (yes K!) instead of the usual 80~100MB/s.

    – Benjamin Sonntag
    Jun 28 '14 at 8:30






  • 1





    the difference between the Enterprise and the disks of desktop drives is that they work out the bugs. If Enterprise drives encounters any error, the disc falls out of the raid. (as in companies sensitive to the risks of storing data and are willing to pay for it) If a desktop drive meets the fault, it will try to last until all timeouts will end. (As users one drive, and to the data they need to reach, and if they fall out discs at once, the manufacturer will be very painful) apparently ST32000542AS is quiet and economical desktop version discs. for example goo.gl/rWb5lj

    – Rainbow-
    Sep 8 '14 at 14:04











  • Actually, just recently, this server suddenly hung, differently and more severely than the original problem, and the logs showed a timeout on a RAID port. The timeout was on one of the enterprise drives (of which this server has more now).

    – Halfgaar
    Sep 9 '14 at 7:15













12












12








12








We have a Debian server with 3Ware 9650SE 8-drive RAID controller, with 5 disk RAID6 array, acting as virtual machine host, all Linux. Problems keep occurring and I suspect an undetected broken disk.



We have had several crashes now where both host and all guests are saying that the IO system blocked for 120 seconds or more. We suspected a faulty RAID controller, but we replaced it with an identical one with identical firmware, which didn't fix it. I didn't think it would, because a second RAID1 array kept working properly.



Almost a week ago (Sunday), when this was acting up, the auto verify was at 66%. Last night (friday morning) it was at 67%. Both before and after booting, and both while experiencing problems. When I turned off the verify with tw_cli /c0/u0 stop verify, things became responsive again.



I suspect it got stuck on a disk fault at around 66%. An auto verify starts on Saturday:



# tw_cli /c0 show verify
/c0 basic verify weekly preferred start: Saturday, 12:00AM


and would normally be long done by Friday. Seeing as how Sunday was 66% and Friday was 67%, it's unlikely to be coincidence.



'smartctl -a -d 3ware,0 /dev/twa0' and 'smartctl -t long' (long SMART self test) on all the drives didn't reveal any errors. Neither does tw_cli /c0 show alarms.



I suspected a disk is broken in a way that is hard to detect, but I took each drive out of the array one by one, created a 'single' array from it and dd'ed full of zeros. No disk showed errors.



Or any other advice?



Edit:



this is the layout:



# tw_cli /c0 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 5587.9 RiW OFF
u1 SPARE OK - - - 1863.01 - OFF
u2 RAID-1 OK - - - 1862.63 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 1.82 TB SATA 0 - ST32000542AS
p1 OK u0 1.82 TB SATA 1 - ST32000542AS
p2 OK u0 1.82 TB SATA 2 - ST32000542AS
p3 OK u0 1.82 TB SATA 3 - ST32000542AS
p4 OK u0 1.82 TB SATA 4 - ST32000542AS
p5 OK u1 1.82 TB SATA 5 - WDC WD2002FYPS-02W3
p6 OK u2 1.82 TB SATA 6 - WDC WD2002FYPS-02W3
p7 OK u2 1.82 TB SATA 7 - WDC WD2002FYPS-02W3

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx


The unit in question is u0.



edit2:



tw_cli /c0 show diag shows something interesting (edit3: this is harmless, I found out it's caused by calling smartctl -a -d 3ware,X /dev/twa0 where X is an invalid port):



QueueAtaPassthrough() called with invalid TargetHandle: 0x17, portHandle: 0xFF

Legacy opcode=0xB1 error=0x10E

E=010E T=14:15:51 : Invalid operation for specified port
E=010E T=14:15:51 U=0 : Return error status to host
Error, Unit 23: Invalid operation for specified port
(EC:0x10e, SK=0x05, ASC=0x24, ASCQ=0x00, SEV=01, Type=0x70)
No additional sense data
Error, Unit 23: 0x10E OVERRIDDEN due to invalid sense buffer descriptor
sense buffer: len=0, address=0x414ca2c7c
Send AEN (code, time): 0031h, 06/21/2013 14:26:16
Synchronize host/controller time
(EC:0x31, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=04, Type=0x71)


I get tons of these. I have no idea what it means though. I can't even make out which unit or port it is. (edit3: I do know now, it's harmless).



Given my edit3, I'm back to square one. Nothing indicates a disk is broken, except that the verify hangs at 66% and causes the array to hang, which also sometimes happens randomly. I wish the verify would find the fault...










share|improve this question
















We have a Debian server with 3Ware 9650SE 8-drive RAID controller, with 5 disk RAID6 array, acting as virtual machine host, all Linux. Problems keep occurring and I suspect an undetected broken disk.



We have had several crashes now where both host and all guests are saying that the IO system blocked for 120 seconds or more. We suspected a faulty RAID controller, but we replaced it with an identical one with identical firmware, which didn't fix it. I didn't think it would, because a second RAID1 array kept working properly.



Almost a week ago (Sunday), when this was acting up, the auto verify was at 66%. Last night (friday morning) it was at 67%. Both before and after booting, and both while experiencing problems. When I turned off the verify with tw_cli /c0/u0 stop verify, things became responsive again.



I suspect it got stuck on a disk fault at around 66%. An auto verify starts on Saturday:



# tw_cli /c0 show verify
/c0 basic verify weekly preferred start: Saturday, 12:00AM


and would normally be long done by Friday. Seeing as how Sunday was 66% and Friday was 67%, it's unlikely to be coincidence.



'smartctl -a -d 3ware,0 /dev/twa0' and 'smartctl -t long' (long SMART self test) on all the drives didn't reveal any errors. Neither does tw_cli /c0 show alarms.



I suspected a disk is broken in a way that is hard to detect, but I took each drive out of the array one by one, created a 'single' array from it and dd'ed full of zeros. No disk showed errors.



Or any other advice?



Edit:



this is the layout:



# tw_cli /c0 show

Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy
------------------------------------------------------------------------------
u0 RAID-6 OK - - 256K 5587.9 RiW OFF
u1 SPARE OK - - - 1863.01 - OFF
u2 RAID-1 OK - - - 1862.63 RiW ON

VPort Status Unit Size Type Phy Encl-Slot Model
------------------------------------------------------------------------------
p0 OK u0 1.82 TB SATA 0 - ST32000542AS
p1 OK u0 1.82 TB SATA 1 - ST32000542AS
p2 OK u0 1.82 TB SATA 2 - ST32000542AS
p3 OK u0 1.82 TB SATA 3 - ST32000542AS
p4 OK u0 1.82 TB SATA 4 - ST32000542AS
p5 OK u1 1.82 TB SATA 5 - WDC WD2002FYPS-02W3
p6 OK u2 1.82 TB SATA 6 - WDC WD2002FYPS-02W3
p7 OK u2 1.82 TB SATA 7 - WDC WD2002FYPS-02W3

Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On Yes OK OK OK 0 xx-xxx-xxxx


The unit in question is u0.



edit2:



tw_cli /c0 show diag shows something interesting (edit3: this is harmless, I found out it's caused by calling smartctl -a -d 3ware,X /dev/twa0 where X is an invalid port):



QueueAtaPassthrough() called with invalid TargetHandle: 0x17, portHandle: 0xFF

Legacy opcode=0xB1 error=0x10E

E=010E T=14:15:51 : Invalid operation for specified port
E=010E T=14:15:51 U=0 : Return error status to host
Error, Unit 23: Invalid operation for specified port
(EC:0x10e, SK=0x05, ASC=0x24, ASCQ=0x00, SEV=01, Type=0x70)
No additional sense data
Error, Unit 23: 0x10E OVERRIDDEN due to invalid sense buffer descriptor
sense buffer: len=0, address=0x414ca2c7c
Send AEN (code, time): 0031h, 06/21/2013 14:26:16
Synchronize host/controller time
(EC:0x31, SK=0x00, ASC=0x00, ASCQ=0x00, SEV=04, Type=0x71)


I get tons of these. I have no idea what it means though. I can't even make out which unit or port it is. (edit3: I do know now, it's harmless).



Given my edit3, I'm back to square one. Nothing indicates a disk is broken, except that the verify hangs at 66% and causes the array to hang, which also sometimes happens randomly. I wish the verify would find the fault...







3ware






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Jul 11 '13 at 14:53







Halfgaar

















asked Jun 21 '13 at 13:04









HalfgaarHalfgaar

5,33543062




5,33543062












  • What HDDs? Are they officially supported?

    – grs
    Jun 21 '13 at 14:08












  • I added the layout. The disks are ST32000542AS. They are supported, but more over, the server worked fine for 3 years.

    – Halfgaar
    Jun 21 '13 at 14:26











  • I had issue with some WD drives becoming verrryyyy slow at some point. only a hdparm (not doable here sadly) showed me a throughoutput of ~300KB/s (yes K!) instead of the usual 80~100MB/s.

    – Benjamin Sonntag
    Jun 28 '14 at 8:30






  • 1





    the difference between the Enterprise and the disks of desktop drives is that they work out the bugs. If Enterprise drives encounters any error, the disc falls out of the raid. (as in companies sensitive to the risks of storing data and are willing to pay for it) If a desktop drive meets the fault, it will try to last until all timeouts will end. (As users one drive, and to the data they need to reach, and if they fall out discs at once, the manufacturer will be very painful) apparently ST32000542AS is quiet and economical desktop version discs. for example goo.gl/rWb5lj

    – Rainbow-
    Sep 8 '14 at 14:04











  • Actually, just recently, this server suddenly hung, differently and more severely than the original problem, and the logs showed a timeout on a RAID port. The timeout was on one of the enterprise drives (of which this server has more now).

    – Halfgaar
    Sep 9 '14 at 7:15

















  • What HDDs? Are they officially supported?

    – grs
    Jun 21 '13 at 14:08












  • I added the layout. The disks are ST32000542AS. They are supported, but more over, the server worked fine for 3 years.

    – Halfgaar
    Jun 21 '13 at 14:26











  • I had issue with some WD drives becoming verrryyyy slow at some point. only a hdparm (not doable here sadly) showed me a throughoutput of ~300KB/s (yes K!) instead of the usual 80~100MB/s.

    – Benjamin Sonntag
    Jun 28 '14 at 8:30






  • 1





    the difference between the Enterprise and the disks of desktop drives is that they work out the bugs. If Enterprise drives encounters any error, the disc falls out of the raid. (as in companies sensitive to the risks of storing data and are willing to pay for it) If a desktop drive meets the fault, it will try to last until all timeouts will end. (As users one drive, and to the data they need to reach, and if they fall out discs at once, the manufacturer will be very painful) apparently ST32000542AS is quiet and economical desktop version discs. for example goo.gl/rWb5lj

    – Rainbow-
    Sep 8 '14 at 14:04











  • Actually, just recently, this server suddenly hung, differently and more severely than the original problem, and the logs showed a timeout on a RAID port. The timeout was on one of the enterprise drives (of which this server has more now).

    – Halfgaar
    Sep 9 '14 at 7:15
















What HDDs? Are they officially supported?

– grs
Jun 21 '13 at 14:08






What HDDs? Are they officially supported?

– grs
Jun 21 '13 at 14:08














I added the layout. The disks are ST32000542AS. They are supported, but more over, the server worked fine for 3 years.

– Halfgaar
Jun 21 '13 at 14:26





I added the layout. The disks are ST32000542AS. They are supported, but more over, the server worked fine for 3 years.

– Halfgaar
Jun 21 '13 at 14:26













I had issue with some WD drives becoming verrryyyy slow at some point. only a hdparm (not doable here sadly) showed me a throughoutput of ~300KB/s (yes K!) instead of the usual 80~100MB/s.

– Benjamin Sonntag
Jun 28 '14 at 8:30





I had issue with some WD drives becoming verrryyyy slow at some point. only a hdparm (not doable here sadly) showed me a throughoutput of ~300KB/s (yes K!) instead of the usual 80~100MB/s.

– Benjamin Sonntag
Jun 28 '14 at 8:30




1




1





the difference between the Enterprise and the disks of desktop drives is that they work out the bugs. If Enterprise drives encounters any error, the disc falls out of the raid. (as in companies sensitive to the risks of storing data and are willing to pay for it) If a desktop drive meets the fault, it will try to last until all timeouts will end. (As users one drive, and to the data they need to reach, and if they fall out discs at once, the manufacturer will be very painful) apparently ST32000542AS is quiet and economical desktop version discs. for example goo.gl/rWb5lj

– Rainbow-
Sep 8 '14 at 14:04





the difference between the Enterprise and the disks of desktop drives is that they work out the bugs. If Enterprise drives encounters any error, the disc falls out of the raid. (as in companies sensitive to the risks of storing data and are willing to pay for it) If a desktop drive meets the fault, it will try to last until all timeouts will end. (As users one drive, and to the data they need to reach, and if they fall out discs at once, the manufacturer will be very painful) apparently ST32000542AS is quiet and economical desktop version discs. for example goo.gl/rWb5lj

– Rainbow-
Sep 8 '14 at 14:04













Actually, just recently, this server suddenly hung, differently and more severely than the original problem, and the logs showed a timeout on a RAID port. The timeout was on one of the enterprise drives (of which this server has more now).

– Halfgaar
Sep 9 '14 at 7:15





Actually, just recently, this server suddenly hung, differently and more severely than the original problem, and the logs showed a timeout on a RAID port. The timeout was on one of the enterprise drives (of which this server has more now).

– Halfgaar
Sep 9 '14 at 7:15










4 Answers
4






active

oldest

votes


















0














This issue may be due to one of the disks encountering a read error and blocking the entire array until it either manages to reallocate the sector or the RAID controller assumes the drive is dead and boots it out of the array, marking it as "Degraded" (this is completely up to the controller in question). This may happen often if a disk is starting to die but still passes SMART. Most consumer disks will continue to attempt the read forever.



This issue is solved in some drives destined for RAID using something called Error recovery control. WD calls this TLER. From the site:



RAID-specific time-limited error recovery (TLER) - Pioneered by WD, this feature prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives.



Basically, it tells a disk that if it cannot read a sector, to give up after x seconds. This is great in a RAID since the data may be recovered from another disk.



From what I've read, the ST32000542AS does not implement any form of ERC so any of them can block the entire array. The WD2002FYPS does in fact implement WD's TLER so they will not cause this issue.






share|improve this answer
































    0














    Just to make sure, what is your firmware version?



    There was an issue I experienced - which sounds a lot like what you are describing - when following requirements are met:



    • 3ware 96xx series controller

    • RAID 6

    • 256k Stripe Size

    • Firmware version < v4.10.00.021*

    At the time there was no firmware fix available so I migrated from 256k to 64k stripe size which also solved the issue. You could try as workaround, though it certainly will take days to complete.



    Later on I tried the new firmware (* 4.10.00.021 I think had the fix) with 256k and worked like a charm. 4.10.00.027 is the latest version.






    share|improve this answer

























    • We don't have the problem anymore. Verification always succeeded. We did however get a complete server hang some months ago (after a long time of no problems). Dmesg said that disk x timed out. I don't know why the controller didn't kick it, but even though it wasn't explicitly marked as degraded, I replaced it. And, other disks have been replaced since then as well. So it's likely it was a disk issue.

      – Halfgaar
      Oct 25 '14 at 12:21


















    0














    2 things that were not brought up so far:



    1. Is this a SATA RAID controller? If so, SATA cables are prone to aging and replacing them might solve such issues easily. Most of the time this can be tried when disk errors, lags, timeouts occur but the SMART values are all ok and the drive passes all self tests. Unfortunately finding a good SATA cable vender is difficult.

    2. 3Ware RAID controllers are old and unsupported these days. You will neither get firmware upgrades nor spare parts. In case your controller dies the RAID might be unrecoverable without the matching controller AND firmware. An expensive data recovery is then needed.





    share|improve this answer






























      0














      I used to have issues with a 3ware controller and Seagate drives. There's a subtle firmware incompatibility. I switched to Samsung drives, problem solved.






      share|improve this answer























        Your Answer








        StackExchange.ready(function()
        var channelOptions =
        tags: "".split(" "),
        id: "2"
        ;
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function()
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled)
        StackExchange.using("snippets", function()
        createEditor();
        );

        else
        createEditor();

        );

        function createEditor()
        StackExchange.prepareEditor(
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader:
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        ,
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        );



        );













        draft saved

        draft discarded


















        StackExchange.ready(
        function ()
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f517533%2f3ware-raid6-array-sometimes-hanging-undetected-broken-disk%23new-answer', 'question_page');

        );

        Post as a guest















        Required, but never shown

























        4 Answers
        4






        active

        oldest

        votes








        4 Answers
        4






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        0














        This issue may be due to one of the disks encountering a read error and blocking the entire array until it either manages to reallocate the sector or the RAID controller assumes the drive is dead and boots it out of the array, marking it as "Degraded" (this is completely up to the controller in question). This may happen often if a disk is starting to die but still passes SMART. Most consumer disks will continue to attempt the read forever.



        This issue is solved in some drives destined for RAID using something called Error recovery control. WD calls this TLER. From the site:



        RAID-specific time-limited error recovery (TLER) - Pioneered by WD, this feature prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives.



        Basically, it tells a disk that if it cannot read a sector, to give up after x seconds. This is great in a RAID since the data may be recovered from another disk.



        From what I've read, the ST32000542AS does not implement any form of ERC so any of them can block the entire array. The WD2002FYPS does in fact implement WD's TLER so they will not cause this issue.






        share|improve this answer





























          0














          This issue may be due to one of the disks encountering a read error and blocking the entire array until it either manages to reallocate the sector or the RAID controller assumes the drive is dead and boots it out of the array, marking it as "Degraded" (this is completely up to the controller in question). This may happen often if a disk is starting to die but still passes SMART. Most consumer disks will continue to attempt the read forever.



          This issue is solved in some drives destined for RAID using something called Error recovery control. WD calls this TLER. From the site:



          RAID-specific time-limited error recovery (TLER) - Pioneered by WD, this feature prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives.



          Basically, it tells a disk that if it cannot read a sector, to give up after x seconds. This is great in a RAID since the data may be recovered from another disk.



          From what I've read, the ST32000542AS does not implement any form of ERC so any of them can block the entire array. The WD2002FYPS does in fact implement WD's TLER so they will not cause this issue.






          share|improve this answer



























            0












            0








            0







            This issue may be due to one of the disks encountering a read error and blocking the entire array until it either manages to reallocate the sector or the RAID controller assumes the drive is dead and boots it out of the array, marking it as "Degraded" (this is completely up to the controller in question). This may happen often if a disk is starting to die but still passes SMART. Most consumer disks will continue to attempt the read forever.



            This issue is solved in some drives destined for RAID using something called Error recovery control. WD calls this TLER. From the site:



            RAID-specific time-limited error recovery (TLER) - Pioneered by WD, this feature prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives.



            Basically, it tells a disk that if it cannot read a sector, to give up after x seconds. This is great in a RAID since the data may be recovered from another disk.



            From what I've read, the ST32000542AS does not implement any form of ERC so any of them can block the entire array. The WD2002FYPS does in fact implement WD's TLER so they will not cause this issue.






            share|improve this answer















            This issue may be due to one of the disks encountering a read error and blocking the entire array until it either manages to reallocate the sector or the RAID controller assumes the drive is dead and boots it out of the array, marking it as "Degraded" (this is completely up to the controller in question). This may happen often if a disk is starting to die but still passes SMART. Most consumer disks will continue to attempt the read forever.



            This issue is solved in some drives destined for RAID using something called Error recovery control. WD calls this TLER. From the site:



            RAID-specific time-limited error recovery (TLER) - Pioneered by WD, this feature prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives.



            Basically, it tells a disk that if it cannot read a sector, to give up after x seconds. This is great in a RAID since the data may be recovered from another disk.



            From what I've read, the ST32000542AS does not implement any form of ERC so any of them can block the entire array. The WD2002FYPS does in fact implement WD's TLER so they will not cause this issue.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited May 31 '17 at 17:00

























            answered May 31 '17 at 16:54









            succulent_headcrabsucculent_headcrab

            92117




            92117























                0














                Just to make sure, what is your firmware version?



                There was an issue I experienced - which sounds a lot like what you are describing - when following requirements are met:



                • 3ware 96xx series controller

                • RAID 6

                • 256k Stripe Size

                • Firmware version < v4.10.00.021*

                At the time there was no firmware fix available so I migrated from 256k to 64k stripe size which also solved the issue. You could try as workaround, though it certainly will take days to complete.



                Later on I tried the new firmware (* 4.10.00.021 I think had the fix) with 256k and worked like a charm. 4.10.00.027 is the latest version.






                share|improve this answer

























                • We don't have the problem anymore. Verification always succeeded. We did however get a complete server hang some months ago (after a long time of no problems). Dmesg said that disk x timed out. I don't know why the controller didn't kick it, but even though it wasn't explicitly marked as degraded, I replaced it. And, other disks have been replaced since then as well. So it's likely it was a disk issue.

                  – Halfgaar
                  Oct 25 '14 at 12:21















                0














                Just to make sure, what is your firmware version?



                There was an issue I experienced - which sounds a lot like what you are describing - when following requirements are met:



                • 3ware 96xx series controller

                • RAID 6

                • 256k Stripe Size

                • Firmware version < v4.10.00.021*

                At the time there was no firmware fix available so I migrated from 256k to 64k stripe size which also solved the issue. You could try as workaround, though it certainly will take days to complete.



                Later on I tried the new firmware (* 4.10.00.021 I think had the fix) with 256k and worked like a charm. 4.10.00.027 is the latest version.






                share|improve this answer

























                • We don't have the problem anymore. Verification always succeeded. We did however get a complete server hang some months ago (after a long time of no problems). Dmesg said that disk x timed out. I don't know why the controller didn't kick it, but even though it wasn't explicitly marked as degraded, I replaced it. And, other disks have been replaced since then as well. So it's likely it was a disk issue.

                  – Halfgaar
                  Oct 25 '14 at 12:21













                0












                0








                0







                Just to make sure, what is your firmware version?



                There was an issue I experienced - which sounds a lot like what you are describing - when following requirements are met:



                • 3ware 96xx series controller

                • RAID 6

                • 256k Stripe Size

                • Firmware version < v4.10.00.021*

                At the time there was no firmware fix available so I migrated from 256k to 64k stripe size which also solved the issue. You could try as workaround, though it certainly will take days to complete.



                Later on I tried the new firmware (* 4.10.00.021 I think had the fix) with 256k and worked like a charm. 4.10.00.027 is the latest version.






                share|improve this answer















                Just to make sure, what is your firmware version?



                There was an issue I experienced - which sounds a lot like what you are describing - when following requirements are met:



                • 3ware 96xx series controller

                • RAID 6

                • 256k Stripe Size

                • Firmware version < v4.10.00.021*

                At the time there was no firmware fix available so I migrated from 256k to 64k stripe size which also solved the issue. You could try as workaround, though it certainly will take days to complete.



                Later on I tried the new firmware (* 4.10.00.021 I think had the fix) with 256k and worked like a charm. 4.10.00.027 is the latest version.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Aug 22 '18 at 20:09









                longneck

                20.9k24075




                20.9k24075










                answered Oct 24 '14 at 11:26









                AcrklorAcrklor

                263




                263












                • We don't have the problem anymore. Verification always succeeded. We did however get a complete server hang some months ago (after a long time of no problems). Dmesg said that disk x timed out. I don't know why the controller didn't kick it, but even though it wasn't explicitly marked as degraded, I replaced it. And, other disks have been replaced since then as well. So it's likely it was a disk issue.

                  – Halfgaar
                  Oct 25 '14 at 12:21

















                • We don't have the problem anymore. Verification always succeeded. We did however get a complete server hang some months ago (after a long time of no problems). Dmesg said that disk x timed out. I don't know why the controller didn't kick it, but even though it wasn't explicitly marked as degraded, I replaced it. And, other disks have been replaced since then as well. So it's likely it was a disk issue.

                  – Halfgaar
                  Oct 25 '14 at 12:21
















                We don't have the problem anymore. Verification always succeeded. We did however get a complete server hang some months ago (after a long time of no problems). Dmesg said that disk x timed out. I don't know why the controller didn't kick it, but even though it wasn't explicitly marked as degraded, I replaced it. And, other disks have been replaced since then as well. So it's likely it was a disk issue.

                – Halfgaar
                Oct 25 '14 at 12:21





                We don't have the problem anymore. Verification always succeeded. We did however get a complete server hang some months ago (after a long time of no problems). Dmesg said that disk x timed out. I don't know why the controller didn't kick it, but even though it wasn't explicitly marked as degraded, I replaced it. And, other disks have been replaced since then as well. So it's likely it was a disk issue.

                – Halfgaar
                Oct 25 '14 at 12:21











                0














                2 things that were not brought up so far:



                1. Is this a SATA RAID controller? If so, SATA cables are prone to aging and replacing them might solve such issues easily. Most of the time this can be tried when disk errors, lags, timeouts occur but the SMART values are all ok and the drive passes all self tests. Unfortunately finding a good SATA cable vender is difficult.

                2. 3Ware RAID controllers are old and unsupported these days. You will neither get firmware upgrades nor spare parts. In case your controller dies the RAID might be unrecoverable without the matching controller AND firmware. An expensive data recovery is then needed.





                share|improve this answer



























                  0














                  2 things that were not brought up so far:



                  1. Is this a SATA RAID controller? If so, SATA cables are prone to aging and replacing them might solve such issues easily. Most of the time this can be tried when disk errors, lags, timeouts occur but the SMART values are all ok and the drive passes all self tests. Unfortunately finding a good SATA cable vender is difficult.

                  2. 3Ware RAID controllers are old and unsupported these days. You will neither get firmware upgrades nor spare parts. In case your controller dies the RAID might be unrecoverable without the matching controller AND firmware. An expensive data recovery is then needed.





                  share|improve this answer

























                    0












                    0








                    0







                    2 things that were not brought up so far:



                    1. Is this a SATA RAID controller? If so, SATA cables are prone to aging and replacing them might solve such issues easily. Most of the time this can be tried when disk errors, lags, timeouts occur but the SMART values are all ok and the drive passes all self tests. Unfortunately finding a good SATA cable vender is difficult.

                    2. 3Ware RAID controllers are old and unsupported these days. You will neither get firmware upgrades nor spare parts. In case your controller dies the RAID might be unrecoverable without the matching controller AND firmware. An expensive data recovery is then needed.





                    share|improve this answer













                    2 things that were not brought up so far:



                    1. Is this a SATA RAID controller? If so, SATA cables are prone to aging and replacing them might solve such issues easily. Most of the time this can be tried when disk errors, lags, timeouts occur but the SMART values are all ok and the drive passes all self tests. Unfortunately finding a good SATA cable vender is difficult.

                    2. 3Ware RAID controllers are old and unsupported these days. You will neither get firmware upgrades nor spare parts. In case your controller dies the RAID might be unrecoverable without the matching controller AND firmware. An expensive data recovery is then needed.






                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Oct 25 '18 at 21:32









                    flohackflohack

                    1313




                    1313





















                        0














                        I used to have issues with a 3ware controller and Seagate drives. There's a subtle firmware incompatibility. I switched to Samsung drives, problem solved.






                        share|improve this answer



























                          0














                          I used to have issues with a 3ware controller and Seagate drives. There's a subtle firmware incompatibility. I switched to Samsung drives, problem solved.






                          share|improve this answer

























                            0












                            0








                            0







                            I used to have issues with a 3ware controller and Seagate drives. There's a subtle firmware incompatibility. I switched to Samsung drives, problem solved.






                            share|improve this answer













                            I used to have issues with a 3ware controller and Seagate drives. There's a subtle firmware incompatibility. I switched to Samsung drives, problem solved.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Apr 7 at 19:15









                            ZdenekZdenek

                            1324




                            1324



























                                draft saved

                                draft discarded
















































                                Thanks for contributing an answer to Server Fault!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid


                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.

                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function ()
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f517533%2f3ware-raid6-array-sometimes-hanging-undetected-broken-disk%23new-answer', 'question_page');

                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Wikipedia:Vital articles Мазмуну Biography - Өмүр баян Philosophy and psychology - Философия жана психология Religion - Дин Social sciences - Коомдук илимдер Language and literature - Тил жана адабият Science - Илим Technology - Технология Arts and recreation - Искусство жана эс алуу History and geography - Тарых жана география Навигация менюсу

                                Bruxelas-Capital Índice Historia | Composición | Situación lingüística | Clima | Cidades irmandadas | Notas | Véxase tamén | Menú de navegacióneO uso das linguas en Bruxelas e a situación do neerlandés"Rexión de Bruxelas Capital"o orixinalSitio da rexiónPáxina de Bruselas no sitio da Oficina de Promoción Turística de Valonia e BruxelasMapa Interactivo da Rexión de Bruxelas-CapitaleeWorldCat332144929079854441105155190212ID28008674080552-90000 0001 0666 3698n94104302ID540940339365017018237

                                What should I write in an apology letter, since I have decided not to join a company after accepting an offer letterShould I keep looking after accepting a job offer?What should I do when I've been verbally told I would get an offer letter, but still haven't gotten one after 4 weeks?Do I accept an offer from a company that I am not likely to join?New job hasn't confirmed starting date and I want to give current employer as much notice as possibleHow should I address my manager in my resignation letter?HR delayed background verification, now jobless as resignedNo email communication after accepting a formal written offer. How should I phrase the call?What should I do if after receiving a verbal offer letter I am informed that my written job offer is put on hold due to some internal issues?Should I inform the current employer that I am about to resign within 1-2 weeks since I have signed the offer letter and waiting for visa?What company will do, if I send their offer letter to another company