Parity check failing at same point after multiple tries

December 23, 201312 yr

Hi all,

Been running a 16 (15 + parity) array for almost a year without any issues.

Just had my first issue and hoping for some guidance so I don't lose any data (or minimize data loss).

I'm running version 5.0.4 on a Supermicro X9SCM-F-O on ESXi.

There are two M1015s both flashed to P15.

I was watching a movie when streaming froze and I checked dmesg on unraid to find:

sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)
sd 2:0:2:0: attempting task abort! scmd(db84c0c0)
sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)
sd 2:0:2:0: attempting task abort! scmd(db84c0c0)
sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)
sd 2:0:2:0: attempting task abort! scmd(db84c0c0)
sd 2:0:2:0: [sdl] CDB:
cdb[0]=0x88: 88 00 00 00 00 01 3c 46 29 c0 00 00 04 00 00 00
scsi target2:0:2: handle(0x000b), sas_address(0x4433221102000000), phy(2)
scsi target2:0:2: enclosure_logical_id(0x500605b0022a1530), slot(1)
sd 2:0:2:0: task abort: SUCCESS scmd(db84c0c0)

Cleanly shut down, rebooted and a parity check kicked off. It grinds along happily at ~90MB/sec until it hit ~77% then those same errors started throwing on the same drive. It then slows down to XXKB/sec and the simplefeatures UI on 80 and the stock UI on 8080 both become _very_ unresponsive.

I had run a successful parity check 100 days prior with no issues and have also checked all cabling thoroughly.

The drives normally run ~30C. They are now up at 38-40C, because it is 35C outside here and they are grinding through parity.

I stopped the check, rebooted, and parity checked again. Again the same process. I will let the parity check continue to plug along at 50KB/sec and hope it eventually speeds up, but the errors are continuing.

I have attached a syslog.

I tried to run smartctl to attach smart info, but it hung and can't be killed with CTRL+C. Running ps seems to indicate the system already tried to run smart unsuccessfully:

I have attached an older smart output for the sdl drive.

root@Tower:~# ps ax | grep smart
2193 ?        S      0:00 sh -c smartctl -d ata -A /dev/sdl| grep -i temperature
2194 ?        D      0:00 smartctl -d ata -A /dev/sdl
8328 pts/1    D+     0:00 smartctl -a -A /dev/sdl
8848 ?        D      0:00 /usr/sbin/smartctl -n standby -A /dev/sdl
8849 ?        Z      0:00 [smartctl] <defunct>
8850 ?        Z      0:00 [smartctl] <defunct>
8851 ?        Z      0:00 [smartctl] <defunct>
9390 pts/2    S+     0:00 grep smart

At this point I assume the drive is done, and I just would like guidance on best ways to proceed. I have a identical brand new (in shrink wrap) 3TB WD Green ready to move forward.

A couple drives have thrown ~100-200 errors during the failed parity rebuild processes, so I'm thinking I might have multiple drives dying at the same time.

Any suggestions on the best way to proceed to minimize data loss would be greatly appreciated.

Thanks so much in advance and happy holidays to all.

syslog_and_smart.zip

Quote

December 23, 201312 yr

Author

Sorry to reply to my own post, but a new data point as it is running _very_ slow through a parity check:

The system appears to be repeatedly running smartctl against the sdl drive.

root@Tower:/var/log# ps auwx | grep smart
root      2606  1.0  0.0   3552  1224 ?        D    22:43   0:00 /usr/sbin/smartctl -n standby -A /dev/sdl
root      2607  1.0  0.0      0     0 ?        Z    22:43   0:00 [smartctl] <defunct>
root      2608  1.0  0.0      0     0 ?        Z    22:43   0:00 [smartctl] <defunct>
root      2609  1.0  0.0      0     0 ?        Z    22:43   0:00 [smartctl] <defunct>
root      2611  0.0  0.0   2448   588 pts/2    R+   22:43   0:00 grep smart
root@Tower:/var/log# ps auwx | grep smart
root      2644  0.0  0.0   3552  1220 ?        D    22:43   0:00 /usr/sbin/smartctl -n standby -A /dev/sdl
root      2645  0.0  0.0      0     0 ?        Z    22:43   0:00 [smartctl] <defunct>
root      2646  0.0  0.0      0     0 ?        Z    22:43   0:00 [smartctl] <defunct>
root      2647  0.0  0.0      0     0 ?        Z    22:43   0:00 [smartctl] <defunct>
root      2649  0.0  0.0    448     4 pts/2    R+   22:43   0:00 grep smart
root@Tower:/var/log# ps auwx | grep smart
root      2709  0.0  0.0   2448   588 pts/2    S+   22:44   0:00 grep smart

Any help would be greatly appreciated. I've been down for days now and really just want to know how to replace this drive or if it needs to be replaced?

The system has all drives as green at the moment.

Quote

December 23, 201312 yr

I recall reading that if you have simplefeatures installed keeping the webgui open will cause this.

Quote

December 24, 201312 yr

[me=DaleWilliams]shakes his fist [/me]

How do we get SimpleFeatures removed from this page:

WIKI: UNRAID5 Plugins

I, too, installed SF early on in 5.0.0Final and suffered through the debug and uninstall process.

Since SF doesn't work consistently, and is unsupported, shouldn't it be removed from the Wiki as being 'officially' compatible with 5.0?

I'm happy to help edit, but the WIKI makes it clear that the plugin author should do the editing.

Quote

December 24, 201312 yr

I'm happy to help edit, but the WIKI makes it clear that the plugin author should do the editing.

So don't remove any info, just add a short paragraph near the top of the simplefeatures entry, stating that some people have had issues using simplefeatures in the 5.x release series, and include forum links to some examples. Don't use hyperbole, don't say how you really feel, just state the facts and give examples. Inform, and let people decide for themselves.

Quote

December 25, 201312 yr

A lot of people are having issue with Sf and version 5. sf is not compatible with version 5. It's intended replacement is the github GUI which has it's own problems.

Quote

Parity check failing at same point after multiple tries

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)