Weird disk failure

March 28, 201610 yr

Hello!

I seem to be having some issues with my array. I've just added a new disk to my array a few days ago, everything worked fine. However, I noticed some paging faults in the system log which seemed to occur whenever I wrote to the array, but not always. Regardless, everything seemed fine until today, when while watching a movie using the server, the playback paused for a few seconds then it resumed. It never does that, so I got suspicious. A quick look at the array status showed that one of the disks (the one the movie was on) marked with a red 'x'. So I checked the log. It show several read errors, then several write errors afterwards (though I did not write to the array). After restarting the server (not the array though), smart was disabled on the drive, so I re-enabled it. Currently running a long smart test on the drive, with the array stopped. I've attached log file, if somebody please advise if it's an actual disk error. Thanks in advance!

tower-syslog-20160328-1038.zip

Quote

March 28, 201610 yr

Community Expert

Writes are normal after a reading error, unRAID calculates the correct data reading from all other disks plus parity and try's to write it back to the problem disk, if it can't the disk is disabled.

Hard to say anything about the disk, post diagnostics or wait for the SMART test to finish.

Quote

March 28, 201610 yr

Community Expert

For V6, always post complete diagnostics zip instead of the syslog.

Do you not have notifications setup? That is one of the best new features of v6 since it would have notified you of the problem when it happened instead of waiting until who knows when you might have actually looked at it. Without notifications you could get to a state where you have multiple issues which would make recovery difficult.

Regardless, you will have to rebuild the disk. If you had posted your diagnostics I would have been able to advise whether it was OK to try the rebuild onto the same disk.

What do I do if I get a red X next to a hard disk?

Quote

March 28, 201610 yr

Author

Thanks for the replies! I've downloaded the Diagnostics, I hope it contains some info for you. Thank You for your help!

tower-diagnostics-20160328-1757.zip

Quote

March 28, 201610 yr

Community Expert

SMART for Disk6 (and all your other disks) looks fine, this usually means it's good, though it's not certain.

What I usually do in these cases is replace disk cables or trade enclosure depending on your setup and rebuild to the same disk, if it fails again then it's probably a bad disk.

Unrelated to your problem but there are several of what look to me memory errors on your syslog:

Mar 27 17:27:12 Tower kernel: warn_alloc_failed: 199 callbacks suppressed
Mar 27 17:27:12 Tower kernel: swapper/0: page allocation failure: order:0, mode:0x20
Mar 27 17:27:12 Tower kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.18-unRAID #1
Mar 27 17:27:12 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./FM2A88M-HD+, BIOS P1.40 10/14/2013
Mar 27 17:27:12 Tower kernel: 0000000000000020 ffff880057a03bd8 ffffffff815f2403 0000000000000000
Mar 27 17:27:12 Tower kernel: 0000000000000296 0000000000000400 0000000000000000 ffff880057a03c68
Mar 27 17:27:12 Tower kernel: ffffffff810b5820 ffff880057a196e8 00000000ffffffff 0000000000000246
Mar 27 17:27:12 Tower kernel: Call Trace:
Mar 27 17:27:12 Tower kernel: <IRQ>  [<ffffffff815f2403>] dump_stack+0x65/0x85
Mar 27 17:27:12 Tower kernel: [<ffffffff810b5820>] warn_alloc_failed+0x102/0x116
Mar 27 17:27:12 Tower kernel: [<ffffffff810b84ca>] __alloc_pages_nodemask+0x66c/0x7b6
Mar 27 17:27:12 Tower kernel: [<ffffffff815029bc>] __alloc_page_frag+0xa4/0x119
Mar 27 17:27:12 Tower kernel: [<ffffffff81507980>] __alloc_rx_skb+0x4c/0xe1
Mar 27 17:27:12 Tower kernel: [<ffffffff81507a64>] __napi_alloc_skb+0x1b/0x3c
Mar 27 17:27:12 Tower kernel: [<ffffffffa003574b>] rtl8169_poll+0x23d/0x4ba [r8169]
Mar 27 17:27:12 Tower kernel: [<ffffffff81511c96>] net_rx_action+0xe0/0x230
Mar 27 17:27:12 Tower kernel: [<ffffffff8104a589>] __do_softirq+0xc9/0x1be
Mar 27 17:27:12 Tower kernel: [<ffffffff8104a80f>] irq_exit+0x3d/0x82
Mar 27 17:27:12 Tower kernel: [<ffffffff8100ca7a>] do_IRQ+0xb3/0xcd
Mar 27 17:27:12 Tower kernel: [<ffffffff815f862e>] common_interrupt+0x6e/0x6e
Mar 27 17:27:12 Tower kernel: <EOI>  [<ffffffff814ddeee>] ? cpuidle_enter_state+0xb6/0x114
Mar 27 17:27:12 Tower kernel: [<ffffffff814dde87>] ? cpuidle_enter_state+0x4f/0x114
Mar 27 17:27:12 Tower kernel: [<ffffffff814ddf6e>] cpuidle_enter+0x12/0x14
Mar 27 17:27:12 Tower kernel: [<ffffffff810728dc>] cpu_startup_entry+0x1e2/0x2b2
Mar 27 17:27:12 Tower kernel: [<ffffffff815e56b1>] rest_init+0x85/0x89
Mar 27 17:27:12 Tower kernel: [<ffffffff818a7ed4>] start_kernel+0x415/0x422
Mar 27 17:27:12 Tower kernel: [<ffffffff818a78b5>] ? set_init_arg+0x56/0x56
Mar 27 17:27:12 Tower kernel: [<ffffffff818a7120>] ? early_idt_handler_array+0x120/0x120
Mar 27 17:27:12 Tower kernel: [<ffffffff818a74c6>] x86_64_start_reservations+0x2a/0x2c
Mar 27 17:27:12 Tower kernel: [<ffffffff818a75ae>] x86_64_start_kernel+0xe6/0xf5
Mar 27 17:27:12 Tower kernel: Mem-Info:
Mar 27 17:27:12 Tower kernel: active_anon:68913 inactive_anon:5972 isolated_anon:0
Mar 27 17:27:12 Tower kernel: active_file:39498 inactive_file:207779 isolated_file:31
Mar 27 17:27:12 Tower kernel: unevictable:0 dirty:39087 writeback:6611 unstable:0
Mar 27 17:27:12 Tower kernel: slab_reclaimable:12524 slab_unreclaimable:6615
Mar 27 17:27:12 Tower kernel: mapped:5438 shmem:70082 pagetables:1019 bounce:0
Mar 27 17:27:12 Tower kernel: free:1842 free_pcp:46 free_cma:0
Mar 27 17:27:12 Tower kernel: Node 0 DMA free:5644kB min:48kB low:60kB high:72kB active_anon:3052kB inactive_anon:108kB active_file:0kB inactive_file:1524kB unevictable:0kB isolated(anon):0kB isolated(file):124kB present:15988kB managed:15904kB mlocked:0kB dirty:764kB writeback:220kB mapped:208kB shmem:2916kB slab_reclaimable:3360kB slab_unreclaimable:604kB kernel_stack:64kB pagetables:24kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Mar 27 17:27:12 Tower kernel: lowmem_reserve[]: 0 1407 1407 1407
Mar 27 17:27:12 Tower kernel: Node 0 DMA32 free:1724kB min:4596kB low:5744kB high:6892kB active_anon:272600kB inactive_anon:23780kB active_file:157992kB inactive_file:829592kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1477344kB managed:1442568kB mlocked:0kB dirty:155584kB writeback:26224kB mapped:21544kB shmem:277412kB slab_reclaimable:46736kB slab_unreclaimable:25856kB kernel_stack:2848kB pagetables:4052kB unstable:0kB bounce:0kB free_pcp:184kB local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Mar 27 17:27:12 Tower kernel: lowmem_reserve[]: 0 0 0 0
Mar 27 17:27:12 Tower kernel: Node 0 DMA: 54*4kB (U) 54*8kB (UE) 58*16kB (UM) 1*32kB (R) 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 5640kB
Mar 27 17:27:12 Tower kernel: Node 0 DMA32: 3*4kB (UR) 24*8kB (MR) 63*16kB (R) 0*32kB 0*64kB 0*128kB 0*256kB 1*512kB (R) 0*1024kB 0*2048kB 0*4096kB = 1724kB
Mar 27 17:27:12 Tower kernel: 317398 total pagecache pages
Mar 27 17:27:12 Tower kernel: 0 pages in swap cache
Mar 27 17:27:12 Tower kernel: Swap cache stats: add 0, delete 0, find 0/0
Mar 27 17:27:12 Tower kernel: Free swap  = 0kB
Mar 27 17:27:12 Tower kernel: Total swap = 0kB
Mar 27 17:27:12 Tower kernel: 373333 pages RAM
Mar 27 17:27:12 Tower kernel: 0 pages HighMem/MovableOnly
Mar 27 17:27:12 Tower kernel: 8715 pages reserved
Mar 27 17:27:12 Tower kernel: swapper/0: page allocation failure: order:0, mode:0x20
Mar 27 17:27:12 Tower kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.1.18-unRAID #1

Maybe someone can chime in on the possible cause of these but it's not good.

Quote

March 28, 201610 yr

Author

SMART for Disk6 (and all your other disks) looks fine, this usually means it's good, though it's not certain.

What I usually do in these cases is replace disk cables or trade enclosure depending on your setup and rebuild to the same disk, if it fails again then it's probably a bad disk.

Thanks! I'll wait for the smart test to finish then (it's at 50% now) and if it shows no problem, I'll attempt a rebuild on the same disk.

Unrelated to your problem but there are several of what look to me memory errors on your syslog:

Maybe someone can chime in on the possible cause of these but it's not good.

That's a bit worrying to me also. It started after I put in the new disk. I have some pretty decent RAM modules though, but of course they can fail also. Could it be PSU related? Since an additional disk is an extra load for sure.

Quote

March 28, 201610 yr

Unrelated to your problem but there are several of what look to me memory errors on your syslog:

Maybe someone can chime in on the possible cause of these but it's not good.

That's a bit worrying to me also. It started after I put in the new disk. I have some pretty decent RAM modules though, but of course they can fail also. Could it be PSU related? Since an additional disk is an extra load for sure.

No need to worry, TEST! When you can, reboot and select the Memtest option on the unRAID boot screen. Let it run for several passes.

Quote

March 28, 201610 yr

Author

Unrelated to your problem but there are several of what look to me memory errors on your syslog:

Maybe someone can chime in on the possible cause of these but it's not good.

That's a bit worrying to me also. It started after I put in the new disk. I have some pretty decent RAM modules though, but of course they can fail also. Could it be PSU related? Since an additional disk is an extra load for sure.

No need to worry, TEST! When you can, reboot and select the Memtest option on the unRAID boot screen. Let it run for several passes.

Will do! Thanks for the advice!

Quote

March 28, 201610 yr

Author

So, the smart long test returned without error (attached). While I'm relieved that the disk is OK, I'm a bit concerned about what could have caused the problem in the first place. I've re-attached all cables (both data and power) for all disks now, hopefully it was some one-off problem. I've started a rebuild on the same disk. I've also ran a couple of passes in MEMTEST and there were no errors. Thanks for all your help!

tower-smart-20160328-2216.zip

Quote

March 29, 201610 yr

Author

Well, that didn't last long... I've successfully rebuilt the disk (before that, I replaced its SATA data cable), then I moved back the rig in its original place (the only difference that it's now connected to an UPS rather than directly to a wall outlet). Then, after starting the array it's once again marked with a red 'x'. So, a quick look at the syslog shows SATA interface errors, then the usual read/write errors. Could it be a problem with the actual SATA port on the MB? How can I check? Sadly, I have no spare ports, all 8 on the motherboard and 2 on an add-on board are connected. I've attached the Diagnostics log if someone please have a look at it...

tower-diagnostics-20160329-0714.zip

Quote

March 29, 201610 yr

Author

I may have found the problem. I wanted to check that after moving back the rig to its original place, the cables are still properly connected. My PSU does not have enough SATA power connectors, so I'm using a couple of 2in1 Y cables to attach 2 drives to 1 molex connector. The disk in question is connected to one end of this Y cable. As I have touched the power cable on the disk that gave the error, I heard a sound as if a drive is spinning up. So I checked the array info and now the other disk, that is connected to the other connector on the same Y cable, has logged some very weird reads (a very large number). So it seems that the power cable is to blame. If this turns out to be the case, should I use the 'trust my array' procedure once I replaced the cable? I can't post logs now as I had to leave for work.

Quote

March 29, 201610 yr

Community Expert

That's a common source of problems, that's why I suggested before that you replace the cables, I meant both, not just the sata cable.

If you haven't written anything to the array with the disk disabled, you can do a new config, trust parity and then run a parity check.

Quote

March 29, 201610 yr

Author

That's a common source of problems, that's why I suggested before that you replace the cables, I meant both, not just the sata cable.

If you haven't written anything to the array with the disk disabled, you can do a new config, trust parity and then run a parity check.

Thank you for your help! I haven't written anything to the array since I had the initial issue, so I'll do that. You're right about the cables of course, but I had no spare power cables so I thought I give it a try anyway, and it appeared to work. I was wrong...

Quote

March 29, 201610 yr

Author

If you haven't written anything to the array with the disk disabled, you can do a new config, trust parity and then run a parity check.

A question regarding the 'Trust array' procedure. Ive read the Wiki entry on the subject but it does not mention V6, only that there's a different procedure for V5, but following the link there did nor really clarify the correct procedure for me.

So (with the array stopped) I have to issue the 'initconfig' command from a console window, then do I need the 'mdcmd set invalidslot 99' and 'mdcmd check NOCORRECT' after that? Or do I need to do something on the webGui? Or is there a different method for V6?

Quote

March 29, 201610 yr

Community Expert

-take a screenshot of your current array assignments

-stop array, go to tools and click new config

-reassign all disks, double check parity disk is in the parity slot

-check the box "parity is already valid"

-start array

Then do a parity check

Quote

March 29, 201610 yr

Author

-take a screenshot of your current array assignments

-stop array, go to tools and click new config

-reassign all disks, double check parity disk is in the parity slot

-check the box "parity is already valid"

-start array

Then do a parity check

Thank you very much!

Quote

March 29, 201610 yr

If you haven't written anything to the array with the disk disabled, you can do a new config, trust parity and then run a parity check.

A question regarding the 'Trust array' procedure. Ive read the Wiki entry on the subject but it does not mention V6, only that there's a different procedure for V5, but following the link there did nor really clarify the correct procedure for me.

So (with the array stopped) I have to issue the 'initconfig' command from a console window, then do I need the 'mdcmd set invalidslot 99' and 'mdcmd check NOCORRECT' after that? Or do I need to do something on the webGui? Or is there a different method for V6?

-take a screenshot of your current array assignments

-stop array, go to tools and click new config

-reassign all disks, double check parity disk is in the parity slot

-check the box "parity is already valid"

-start array

Then do a parity check

That wiki page has been one of the most embarrassing pages, so out of date! So I added a quick section for v6 at the top, based on your instructions; just a stop gap for now, really need to rewrite the whole page.

Quote

March 29, 201610 yr

Author

That wiki page has been one of the most embarrassing pages, so out of date! So I added a quick section for v6 at the top, based on your instructions; just a stop gap for now, really need to rewrite the whole page.

Thanks!

Regarding my issue, it seems that it's the SATA port that is faulty. After changing the power Y-cable and making a new configuration, the drive once again got logged as faulty. I've swapped port with another drive (I have no free ports so I had no other option to test), then after a new configuration, it's now the other drive that's listed as faulty. Always ata4.00. Looks like I have to find a SATA-card. The problem is, all the cards available in my country seems to be using the SiL3132 chipset which I've read is not recommended... I'm thinking of getting the AOC-SASLP-MV8. Is that a good option? Can I use it in the x16 slot which is normally used for a video card?

Quote

March 29, 201610 yr

I use a Syba Si3132 card, have for years. In my opinion, there's no problem if you make sure it's from a reputable company, AND use only one. I think the problem was with certain knockoffs, especially if there were 2 in the same system. But they're older tech, check out the ASM1061 cards, like this one. They are faster.

Quote

March 29, 201610 yr

Author

I use a Syba Si3132 card, have for years. In my opinion, there's no problem if you make sure it's from a reputable company, AND use only one. I think the problem was with certain knockoffs, especially if there were 2 in the same system. But they're older tech, check out the ASM1061 cards, like this one. They are faster.

Thanks for the info. That ASM1061 card sounds good, but it's not available here, and I'm not sure if I want to wait a few weeks with my array unprotected... The cards available here are almost exclusively Delock branded. I don't know if it's a reputably company though... I've found this card with Marvell 9128. Could it be a better choice than the Sil3132s?

Quote

March 29, 201610 yr

I've found this card with Marvell 9128. Could it be a better choice than the Sil3132s?

Lately, Marvell cards are a gamble, you should take a look here -> Marvell disk controller chipsets and virtualization

If you aren't interested in VM's and hardware passthrough, then that card is an option. If it were me, and I wanted something quickly and cheap, I'd get the Sil3132, for now, and keep an eye out for something better in the future.

Quote

March 30, 201610 yr

Author

I've found this card with Marvell 9128. Could it be a better choice than the Sil3132s?

Lately, Marvell cards are a gamble, you should take a look here -> Marvell disk controller chipsets and virtualization

If you aren't interested in VM's and hardware passthrough, then that card is an option. If it were me, and I wanted something quickly and cheap, I'd get the Sil3132, for now, and keep an eye out for something better in the future.

Thanks for the advice! Though I'm not interested in VM, I'll get a SiL3132 card instead. That's the quickest option now anyway.

Quote

Weird disk failure

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)