Partial crash every day?

May 4, 201214 yr

I am running unRAID 5.0beta12 for many months now without any issue. I have unmenu installed, with only the powerdown on overtemp, clean power down, monthly parity check and cleanup .DS_* packages installed.

Recently (past week) I came home to find that I could not access my file shares.

It looked like emhttpd and unmenu were down, but I could SSH in.

The same symptoms have happened pretty much every day in the past week. I have tried the following various things to get unRAID to shutdown cleanly every day, and none of it ever works. I always end up having to hard reset the machine and that initiates a correcting parity check when I start the array via emhttpd.

I tried restarting emhttpd and unmenu, but nothing happened. Still could not access either.

I tried ctrl-alt-delete from the console, and it seemed to start the shutdown process, but never actually shutdown.

I tried executing /sbin/powerdown, which seemed to start the shutdown process, but never actually shutdown. Sometimes it said "Powerdown already active, this one is exiting".

I tried executing /sbin/reboot, which seemed to start the shutdown process, but never actually shutdown.

Sometimes I was unable to SSH in (but network was still up), or even log in via console (even though it accepted my username and password, it just never got to the prompt -- if I put in a wrong password I got a wrong password error).

After I inevitably hard reboot and start the array, I am able to access my files again and since the parity check takes so long, I leave it running over night. The next morning, I always forget to check it before I go to work, but as soon as I get home every night unRAID is unresponsive again, and I am forced to hard reset.

I have attached a copy of the syslog from tonight, the output from `ps aux`, and `dmesg`.

The trouble appears to begin at 03:51:58 with:

May  4 03:51:58 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:51:58 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x80000 (System)
May  4 03:51:58 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:51:58 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x80000 (System)
May  4 03:51:58 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:51:58 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x80000 (System)
May  4 03:52:25 unraid kernel: sas: command 0xf4b003c0, task 0xf7162000, timed out: BLK_EH_NOT_HANDLED (Drive related)
May  4 03:52:28 unraid kernel: sas: command 0xf76e2840, task 0xf71623c0, timed out: BLK_EH_NOT_HANDLED (Drive related)
May  4 03:52:28 unraid kernel: sas: Enter sas_scsi_recover_host (Drive related)
May  4 03:52:28 unraid kernel: sas: trying to find task 0xf7162000 (Drive related)
May  4 03:52:28 unraid kernel: sas: sas_scsi_find_task: aborting task 0xf7162000 (Drive related)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1818:<7>mv_abort_task() mvi=f7760000 task=f7162000 slot=f7771640 slot_idx=x2 (System)
May  4 03:52:28 unraid kernel: sas: sas_scsi_find_task: querying task 0xf7162000 (Drive related)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5 (System)
May  4 03:52:28 unraid kernel: sas: sas_scsi_find_task: task 0xf7162000 failed to abort (Minor Issues)
May  4 03:52:28 unraid kernel: sas: task 0xf7162000 is not at LU: I_T recover (Drive related)
May  4 03:52:28 unraid kernel: sas: I_T nexus reset for dev 0300000000000000 (Drive related)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 3 ctrl sts=0x89800. (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 3 irq sts = 0x1001 (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2226:phy3 Unplug Notice (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 3 ctrl sts=0x199800. (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 3 irq sts = 0x1081 (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 3 ctrl sts=0x199800. (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 3 irq sts = 0x10000 (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2253:notify plug in on phy[3] (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1338:port 3 attach dev info is 2020003 (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1340:port 3 attach sas addr is 3 (System)
May  4 03:52:28 unraid kernel: drivers/scsi/mvsas/mv_sas.c 379:phy 3 byte dmaded. (System)
May  4 03:52:28 unraid kernel: sas: sas_form_port: phy3 belongs to port2 already(1)! (Drive related)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[2]:rc= 0 (System)
May  4 03:52:30 unraid kernel: sas: I_T 0300000000000000 recovered (Drive related)
May  4 03:52:30 unraid kernel: sas: sas_ata_task_done: SAS error 8d (Errors)
May  4 03:52:30 unraid kernel: sas: trying to find task 0xf71623c0 (Drive related)
May  4 03:52:30 unraid kernel: sas: sas_scsi_find_task: aborting task 0xf71623c0 (Drive related)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1818:<7>mv_abort_task() mvi=f7760000 task=f71623c0 slot=f77715d8 slot_idx=x0 (System)
May  4 03:52:30 unraid kernel: sas: sas_scsi_find_task: querying task 0xf71623c0 (Drive related)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1747:mvs_query_task:rc= 5 (System)
May  4 03:52:30 unraid kernel: sas: sas_scsi_find_task: task 0xf71623c0 failed to abort (Minor Issues)
May  4 03:52:30 unraid kernel: sas: task 0xf71623c0 is not at LU: I_T recover (Drive related)
May  4 03:52:30 unraid kernel: sas: I_T nexus reset for dev 0200000000000000 (Drive related)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x89800. (System)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x1001 (System)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2226:phy2 Unplug Notice (System)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:52:30 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x1081 (System)
May  4 03:52:31 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:52:31 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x10000 (System)
May  4 03:52:31 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2253:notify plug in on phy[2] (System)
May  4 03:52:31 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1338:port 2 attach dev info is 42000000 (System)
May  4 03:52:31 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1340:port 2 attach sas addr is 2 (System)
May  4 03:52:31 unraid kernel: drivers/scsi/mvsas/mv_sas.c 379:phy 2 byte dmaded. (System)
May  4 03:52:31 unraid kernel: sas: sas_form_port: phy2 belongs to port1 already(1)! (Drive related)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[1]:rc= 0 (System)
May  4 03:52:33 unraid kernel: sas: I_T 0200000000000000 recovered (Drive related)
May  4 03:52:33 unraid kernel: sas: sas_ata_task_done: SAS error 8d (Errors)
May  4 03:52:33 unraid kernel: ata1: sas eh calling libata port error handler (Errors)
May  4 03:52:33 unraid kernel: ata2: sas eh calling libata port error handler (Errors)
May  4 03:52:33 unraid kernel: sas: sas_ata_task_done: SAS error 2 (Errors)
May  4 03:52:33 unraid kernel: ata2: failed to read log page 10h (errno=-5) (Minor Issues)
May  4 03:52:33 unraid kernel: ata2.00: exception Emask 0x1 SAct 0x1 SErr 0x0 action 0x6 t0 (Errors)
May  4 03:52:33 unraid kernel: ata2.00: failed command: READ FPDMA QUEUED (Minor Issues)
May  4 03:52:33 unraid kernel: ata2.00: cmd 60/08:00:d8:a8:04/00:00:4b:00:00/40 tag 0 ncq 4096 in (Drive related)
May  4 03:52:33 unraid kernel:          res 01/04:04:c0:c2:b2/00:00:5a:00:00/40 Emask 0x3 (HSM violation) (Errors)
May  4 03:52:33 unraid kernel: ata2.00: status: { ERR } (Drive related)
May  4 03:52:33 unraid kernel: ata2.00: error: { ABRT } (Errors)
May  4 03:52:33 unraid kernel: ata2: hard resetting link (Minor Issues)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x89800. (System)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x1001 (System)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2226:phy2 Unplug Notice (System)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x11081 (System)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2253:notify plug in on phy[2] (System)
May  4 03:52:33 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2278:plugin interrupt but phy2 is gone (System)
May  4 03:52:35 unraid kernel: mvsas 0000:03:00.0: Phy2 : No sig fis (Drive related)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2139:phy2 Attached Device (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x89800. (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x1001 (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2226:phy2 Unplug Notice (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x81 (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2198:port 2 ctrl sts=0x199800. (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2200:Port 2 irq sts = 0x10000 (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 2253:notify plug in on phy[2] (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1338:port 2 attach dev info is 42000000 (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1340:port 2 attach sas addr is 2 (System)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 379:phy 2 byte dmaded. (System)
May  4 03:52:35 unraid kernel: sas: sas_form_port: phy2 belongs to port1 already(1)! (Drive related)
May  4 03:52:35 unraid kernel: drivers/scsi/mvsas/mv_sas.c 1701:mvs_I_T_nexus_reset for device[1]:rc= 0 (System)
May  4 03:52:35 unraid kernel: sas: sas_ata_hard_reset: Found ATA device. (Drive related)
May  4 03:52:35 unraid kernel: ata2.00: configured for UDMA/133 (Drive related)
May  4 03:52:35 unraid kernel: ata2: EH complete (Drive related)
May  4 03:52:35 unraid kernel: ata3: sas eh calling libata port error handler (Errors)
May  4 03:52:35 unraid kernel: ata3.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0 t0 (Errors)
May  4 03:52:35 unraid kernel: ata3.00: failed command: READ FPDMA QUEUED (Minor Issues)
May  4 03:52:35 unraid kernel: ata3.00: cmd 60/00:00:00:33:58/02:00:d5:00:00/40 tag 0 ncq 262144 in (Drive related)
May  4 03:52:35 unraid kernel:          res 41/40:00:68:33:58/00:00:d5:00:00/40 Emask 0x409 (media error) <F> (Errors)
May  4 03:52:35 unraid kernel: ata3.00: status: { DRDY ERR } (Drive related)
May  4 03:52:35 unraid kernel: ata3.00: error: { UNC } (Errors)
May  4 03:52:35 unraid kernel: ata3.00: configured for UDMA/133 (Drive related)
May  4 03:52:35 unraid kernel: ata3: EH complete (Drive related)
May  4 03:52:35 unraid kernel: ata4: sas eh calling libata port error handler (Errors)
May  4 03:52:35 unraid kernel: ata5: sas eh calling libata port error handler (Errors)
May  4 03:52:35 unraid kernel: ata6: sas eh calling libata port error handler (Errors)
May  4 03:52:35 unraid kernel: sas: --- Exit sas_scsi_recover_host (Drive related)
May  4 03:52:35 unraid kernel: sas: sas_ata_task_done: SAS error 2 (Errors)

After which I can only assume that the system is toast in an unrecoverable way, because nothing else interesting appears in the log.

I have no idea what any of the above means. Is one or more than one of my drives failing?

Is it a temperative issue? We are getting into the colder months now, and I didn't have any problems through the height of summer.

My power supply is a Seasonic X-560 80Plus Gold 560W, which I think should be enough to power 5 WD green drives and an SSD.

I run the simplest setup I can inside ESXi 5.0, with my pci-e controller passed directly through. This has worked fine for many months until now.

How do I stop this cycle?

Thanks for any help!

unraid-dmesg-ps-syslog.zip

Quote

May 4, 201214 yr

Author

This is what happens when I run powerdown from SSH:

root@unraid:~# powerdown 
Capturing information to syslog. Please wait...
version[7207]: Linux version 3.0.3-unRAID (root@unraid) (gcc version 4.4.4 (GCC) ) #1 SMP Mon Oct 10 11:59:41 EST 2011
ls: cannot access /dev/hd[a-z]: No such file or directory

Syslog:

May  4 20:10:11 unraid root: Powerdown initiated
May  4 20:10:11 unraid rc.unRAID[7206]: Stopping unRAID.
May  4 20:10:11 unraid version[7207]: Linux version 3.0.3-unRAID (root@unraid) (gcc version 4.4.4 (GCC) ) #1 SMP Mon Oct 10 11:59:41 EST 2011
May  4 20:10:11 unraid cmdline[7208]: initrd=bzroot BOOT_IMAGE=bzimage 
May  4 20:10:11 unraid meminfo[7209]: MemTotal:        1033928 kB
May  4 20:10:11 unraid meminfo[7209]: MemFree:          190256 kB
May  4 20:10:11 unraid meminfo[7209]: Buffers:          155460 kB
May  4 20:10:11 unraid meminfo[7209]: Cached:           553260 kB
May  4 20:10:11 unraid meminfo[7209]: SwapCached:            0 kB
May  4 20:10:11 unraid meminfo[7209]: Active:           147468 kB
May  4 20:10:11 unraid meminfo[7209]: Inactive:         265364 kB
May  4 20:10:11 unraid meminfo[7209]: Active(anon):      33324 kB
May  4 20:10:11 unraid meminfo[7209]: Inactive(anon):      108 kB
May  4 20:10:11 unraid meminfo[7209]: Active(file):     114144 kB
May  4 20:10:11 unraid meminfo[7209]: Inactive(file):   265256 kB
May  4 20:10:11 unraid meminfo[7209]: Unevictable:      329184 kB
May  4 20:10:11 unraid meminfo[7209]: Mlocked:               0 kB
May  4 20:10:11 unraid meminfo[7209]: HighTotal:        135112 kB
May  4 20:10:11 unraid meminfo[7209]: HighFree:              0 kB
May  4 20:10:11 unraid meminfo[7209]: LowTotal:         898816 kB
May  4 20:10:11 unraid meminfo[7209]: LowFree:          190256 kB
May  4 20:10:11 unraid meminfo[7209]: SwapTotal:             0 kB
May  4 20:10:11 unraid meminfo[7209]: SwapFree:              0 kB
May  4 20:10:11 unraid meminfo[7209]: Dirty:                 0 kB
May  4 20:10:11 unraid meminfo[7209]: Writeback:             0 kB
May  4 20:10:11 unraid meminfo[7209]: AnonPages:         33288 kB
May  4 20:10:11 unraid meminfo[7209]: Mapped:            12352 kB
May  4 20:10:11 unraid meminfo[7209]: Shmem:               152 kB
May  4 20:10:11 unraid meminfo[7209]: Slab:              57728 kB
May  4 20:10:11 unraid meminfo[7209]: SReclaimable:      47080 kB
May  4 20:10:11 unraid meminfo[7209]: SUnreclaim:        10648 kB
May  4 20:10:11 unraid meminfo[7209]: KernelStack:        1120 kB
May  4 20:10:11 unraid meminfo[7209]: PageTables:         1840 kB
May  4 20:10:11 unraid meminfo[7209]: NFS_Unstable:          0 kB
May  4 20:10:11 unraid meminfo[7209]: Bounce:                0 kB
May  4 20:10:11 unraid meminfo[7209]: WritebackTmp:          0 kB
May  4 20:10:11 unraid meminfo[7209]: CommitLimit:      516964 kB
May  4 20:10:11 unraid meminfo[7209]: Committed_AS:     185872 kB
May  4 20:10:11 unraid meminfo[7209]: VmallocTotal:     122880 kB
May  4 20:10:11 unraid meminfo[7209]: VmallocUsed:        6064 kB
May  4 20:10:11 unraid meminfo[7209]: VmallocChunk:     115552 kB
May  4 20:10:11 unraid meminfo[7209]: DirectMap4k:        6136 kB
May  4 20:10:11 unraid meminfo[7209]: DirectMap2M:      907264 kB
May  4 20:10:11 unraid devices[7210]: Character devices:
May  4 20:10:11 unraid devices[7210]:   1 mem
May  4 20:10:11 unraid devices[7210]:   2 pty
May  4 20:10:11 unraid devices[7210]:   3 ttyp
May  4 20:10:11 unraid devices[7210]:   4 /dev/vc/0
May  4 20:10:11 unraid devices[7210]:   4 tty
May  4 20:10:11 unraid devices[7210]:   4 ttyS
May  4 20:10:11 unraid devices[7210]:   5 /dev/tty
May  4 20:10:11 unraid devices[7210]:   5 /dev/console
May  4 20:10:11 unraid devices[7210]:   5 /dev/ptmx
May  4 20:10:11 unraid devices[7210]:   6 lp
May  4 20:10:11 unraid devices[7210]:   7 vcs
May  4 20:10:11 unraid devices[7210]:  10 misc
May  4 20:10:11 unraid devices[7210]:  13 input
May  4 20:10:11 unraid devices[7210]: 128 ptm
May  4 20:10:11 unraid devices[7210]: 136 pts
May  4 20:10:11 unraid devices[7210]: 180 usb
May  4 20:10:11 unraid devices[7210]: 189 usb_device
May  4 20:10:11 unraid devices[7210]: 202 cpu/msr
May  4 20:10:11 unraid devices[7210]: 203 cpu/cpuid
May  4 20:10:11 unraid devices[7210]: 252 hidraw
May  4 20:10:11 unraid devices[7210]: 253 uio
May  4 20:10:11 unraid devices[7210]: 254 bsg
May  4 20:10:11 unraid devices[7210]: 
May  4 20:10:11 unraid devices[7210]: Block devices:
May  4 20:10:11 unraid devices[7210]:   2 fd
May  4 20:10:11 unraid devices[7210]:   3 ide0
May  4 20:10:11 unraid devices[7210]: 259 blkext
May  4 20:10:11 unraid devices[7210]:   7 loop
May  4 20:10:11 unraid devices[7210]:   8 sd
May  4 20:10:11 unraid devices[7210]:   9 md
May  4 20:10:11 unraid devices[7210]:  22 ide1
May  4 20:10:11 unraid devices[7210]:  65 sd
May  4 20:10:11 unraid devices[7210]:  66 sd
May  4 20:10:11 unraid devices[7210]:  67 sd
May  4 20:10:11 unraid devices[7210]:  68 sd
May  4 20:10:11 unraid devices[7210]:  69 sd
May  4 20:10:11 unraid devices[7210]:  70 sd
May  4 20:10:11 unraid devices[7210]:  71 sd
May  4 20:10:11 unraid devices[7210]: 128 sd
May  4 20:10:11 unraid devices[7210]: 129 sd
May  4 20:10:11 unraid devices[7210]: 130 sd
May  4 20:10:11 unraid devices[7210]: 131 sd
May  4 20:10:11 unraid devices[7210]: 132 sd
May  4 20:10:11 unraid devices[7210]: 133 sd
May  4 20:10:11 unraid devices[7210]: 134 sd
May  4 20:10:11 unraid devices[7210]: 135 sd
May  4 20:10:11 unraid interrupts[7211]:            CPU0       
May  4 20:10:11 unraid interrupts[7211]:   0:         31   IO-APIC-edge      timer
May  4 20:10:11 unraid interrupts[7211]:   1:         68   IO-APIC-edge      i8042
May  4 20:10:11 unraid interrupts[7211]:   6:          3   IO-APIC-edge      floppy
May  4 20:10:11 unraid interrupts[7211]:   7:          0   IO-APIC-edge      parport0
May  4 20:10:11 unraid interrupts[7211]:   9:          0   IO-APIC-fasteoi   acpi
May  4 20:10:11 unraid interrupts[7211]:  12:          4   IO-APIC-edge      i8042
May  4 20:10:11 unraid interrupts[7211]:  14:          0   IO-APIC-edge      ide0
May  4 20:10:11 unraid interrupts[7211]:  15:          0   IO-APIC-edge      ide1
May  4 20:10:11 unraid interrupts[7211]:  16:       1506   IO-APIC-fasteoi   ehci_hcd:usb1
May  4 20:10:11 unraid interrupts[7211]:  18:   39032320   IO-APIC-fasteoi   uhci_hcd:usb2, mvsas
May  4 20:10:11 unraid interrupts[7211]:  19:    2819315   IO-APIC-fasteoi   eth0
May  4 20:10:11 unraid interrupts[7211]:  40:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  41:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  42:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  43:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  44:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  45:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  46:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  47:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  48:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  49:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  50:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  51:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  52:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  53:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  54:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  55:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  56:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  57:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  58:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  59:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  60:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  61:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  62:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  63:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  64:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  65:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  66:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  67:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  68:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  69:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  70:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  71:          0   PCI-MSI-edge      PCIe PME
May  4 20:10:11 unraid interrupts[7211]:  72:          0   PCI-MSI-edge      vmci
May  4 20:10:11 unraid interrupts[7211]:  73:          0   PCI-MSI-edge      vmci
May  4 20:10:11 unraid interrupts[7211]: NMI:          0   Non-maskable interrupts
May  4 20:10:11 unraid interrupts[7211]: LOC:    8141084   Local timer interrupts
May  4 20:10:11 unraid interrupts[7211]: SPU:          0   Spurious interrupts
May  4 20:10:11 unraid interrupts[7211]: PMI:          0   Performance monitoring interrupts
May  4 20:10:11 unraid interrupts[7211]: IWI:          0   IRQ work interrupts
May  4 20:10:11 unraid interrupts[7211]: RES:          0   Rescheduling interrupts
May  4 20:10:11 unraid interrupts[7211]: CAL:          0   Function call interrupts
May  4 20:10:11 unraid interrupts[7211]: TLB:          0   TLB shootdowns
May  4 20:10:11 unraid interrupts[7211]: TRM:          0   Thermal event interrupts
May  4 20:10:11 unraid interrupts[7211]: THR:          0   Threshold APIC interrupts
May  4 20:10:11 unraid interrupts[7211]: MCE:          0   Machine check exceptions
May  4 20:10:11 unraid interrupts[7211]: MCP:        272   Machine check polls
May  4 20:10:11 unraid interrupts[7211]: ERR:          0
May  4 20:10:11 unraid interrupts[7211]: MIS:          0
May  4 20:10:11 unraid ioports[7212]: 0000-0cf7 : PCI Bus 0000:00
May  4 20:10:11 unraid ioports[7212]:   0000-001f : dma1
May  4 20:10:11 unraid ioports[7212]:   0020-0021 : pic1
May  4 20:10:11 unraid ioports[7212]:   0040-0043 : timer0
May  4 20:10:11 unraid ioports[7212]:   0050-0053 : timer1
May  4 20:10:11 unraid ioports[7212]:   0060-0060 : keyboard
May  4 20:10:11 unraid ioports[7212]:   0064-0064 : keyboard
May  4 20:10:11 unraid ioports[7212]:   0070-0077 : rtc
May  4 20:10:11 unraid ioports[7212]:   0080-008f : dma page reg
May  4 20:10:11 unraid ioports[7212]:   00a0-00a1 : pic2
May  4 20:10:11 unraid ioports[7212]:   00c0-00df : dma2
May  4 20:10:11 unraid ioports[7212]:   00f0-00ff : fpu
May  4 20:10:11 unraid ioports[7212]:   0170-0177 : 0000:00:07.1
May  4 20:10:11 unraid ioports[7212]:     0170-0177 : piix
May  4 20:10:11 unraid ioports[7212]:   01f0-01f7 : 0000:00:07.1
May  4 20:10:11 unraid ioports[7212]:     01f0-01f7 : piix
May  4 20:10:11 unraid ioports[7212]:   0376-0376 : 0000:00:07.1
May  4 20:10:11 unraid ioports[7212]:     0376-0376 : piix
May  4 20:10:11 unraid ioports[7212]:   0378-037a : parport0
May  4 20:10:11 unraid ioports[7212]:   03c0-03df : vga+
May  4 20:10:11 unraid ioports[7212]:   03f2-03f2 : floppy
May  4 20:10:11 unraid ioports[7212]:   03f4-03f5 : floppy
May  4 20:10:11 unraid ioports[7212]:   03f6-03f6 : 0000:00:07.1
May  4 20:10:11 unraid ioports[7212]:     03f6-03f6 : piix
May  4 20:10:11 unraid ioports[7212]:   03f7-03f7 : floppy
May  4 20:10:11 unraid ioports[7212]:   03f8-03ff : serial
May  4 20:10:11 unraid ioports[7212]:   0cf0-0cf1 : pnp 00:01
May  4 20:10:11 unraid ioports[7212]: 0cf8-0cff : PCI conf1
May  4 20:10:11 unraid ioports[7212]: 0d00-feff : PCI Bus 0000:00
May  4 20:10:11 unraid ioports[7212]:   1000-103f : 0000:00:07.3
May  4 20:10:11 unraid ioports[7212]:     1000-103f : pnp 00:01
May  4 20:10:11 unraid ioports[7212]:       1000-1003 : ACPI PM1a_EVT_BLK
May  4 20:10:11 unraid ioports[7212]:       1004-1005 : ACPI PM1a_CNT_BLK
May  4 20:10:11 unraid ioports[7212]:       1008-100b : ACPI PM_TMR
May  4 20:10:11 unraid ioports[7212]:       100c-100f : ACPI GPE0_BLK
May  4 20:10:11 unraid ioports[7212]:       1010-1015 : ACPI CPU throttle
May  4 20:10:11 unraid ioports[7212]:   1040-104f : 0000:00:07.3
May  4 20:10:11 unraid ioports[7212]:     1040-104f : pnp 00:01
May  4 20:10:11 unraid ioports[7212]:   1060-107f : pnp 00:0d
May  4 20:10:11 unraid ioports[7212]:   1080-10bf : 0000:00:07.7
May  4 20:10:11 unraid ioports[7212]:     1080-10bf : vmci
May  4 20:10:11 unraid ioports[7212]:   10c0-10cf : 0000:00:07.1
May  4 20:10:11 unraid ioports[7212]:     10c0-10cf : piix
May  4 20:10:11 unraid ioports[7212]:   10d0-10df : 0000:00:0f.0
May  4 20:10:11 unraid ioports[7212]:   2000-3fff : PCI Bus 0000:02
May  4 20:10:11 unraid ioports[7212]:     2000-203f : 0000:02:01.0
May  4 20:10:11 unraid ioports[7212]:       2000-203f : e1000
May  4 20:10:11 unraid ioports[7212]:     2040-205f : 0000:02:00.0
May  4 20:10:11 unraid ioports[7212]:       2040-205f : uhci_hcd
May  4 20:10:11 unraid ioports[7212]:   4000-4fff : PCI Bus 0000:03
May  4 20:10:11 unraid ioports[7212]:     4000-407f : 0000:03:00.0
May  4 20:10:11 unraid ioports[7212]:       4000-407f : mvsas
May  4 20:10:11 unraid ioports[7212]:   5000-5fff : PCI Bus 0000:0b
May  4 20:10:11 unraid ioports[7212]:   6000-6fff : PCI Bus 0000:13
May  4 20:10:11 unraid ioports[7212]:   7000-7fff : PCI Bus 0000:1b
May  4 20:10:11 unraid ioports[7212]:   8000-8fff : PCI Bus 0000:04
May  4 20:10:11 unraid ioports[7212]:   9000-9fff : PCI Bus 0000:0c
May  4 20:10:11 unraid ioports[7212]:   a000-afff : PCI Bus 0000:14
May  4 20:10:11 unraid ioports[7212]:   b000-bfff : PCI Bus 0000:1c
May  4 20:10:11 unraid ioports[7212]:   c000-cfff : PCI Bus 0000:05
May  4 20:10:11 unraid ioports[7212]:   d000-dfff : PCI Bus 0000:0d
May  4 20:10:11 unraid ioports[7212]:   e000-efff : PCI Bus 0000:15
May  4 20:10:11 unraid dma[7213]:  2: floppy
May  4 20:10:11 unraid dma[7213]:  4: cascade
May  4 20:10:11 unraid mounts[7214]: rootfs / rootfs rw,relatime 0 0
May  4 20:10:11 unraid mounts[7214]: proc /proc proc rw,relatime 0 0
May  4 20:10:11 unraid mounts[7214]: sysfs /sys sysfs rw,relatime 0 0
May  4 20:10:11 unraid mounts[7214]: tmpfs /dev tmpfs rw,relatime,mode=755 0 0
May  4 20:10:11 unraid mounts[7214]: devpts /dev/pts devpts rw,relatime,gid=5,mode=620 0 0
May  4 20:10:11 unraid mounts[7214]: fusectl /sys/fs/fuse/connections fusectl rw,relatime 0 0
May  4 20:10:11 unraid mounts[7214]: /dev/sda1 /boot vfat rw,noatime,nodiratime,fmask=0000,dmask=0000,allow_utime=0022,codepage=cp437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro 0 0
May  4 20:10:11 unraid mounts[7214]: /dev/md4 /mnt/disk4 reiserfs rw,noatime,nodiratime,user_xattr,acl 0 0
May  4 20:10:11 unraid mounts[7214]: /dev/md2 /mnt/disk2 reiserfs rw,noatime,nodiratime,user_xattr,acl 0 0
May  4 20:10:11 unraid mounts[7214]: /dev/md5 /mnt/disk5 reiserfs rw,noatime,nodiratime,user_xattr,acl 0 0
May  4 20:10:11 unraid mounts[7214]: /dev/md3 /mnt/disk3 reiserfs rw,noatime,nodiratime,user_xattr,acl 0 0
May  4 20:10:11 unraid mounts[7214]: /dev/md1 /mnt/disk1 reiserfs rw,noatime,nodiratime,user_xattr,acl 0 0
May  4 20:10:11 unraid mounts[7214]: shfs /mnt/user fuse.shfs rw,nosuid,nodev,noatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
May  4 20:10:11 unraid diskstats[7215]:    2       0 fd0 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       0 loop0 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       1 loop1 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       2 loop2 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       3 loop3 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       4 loop4 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       5 loop5 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       6 loop6 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    7       7 loop7 0 0 0 0 0 0 0 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    8       0 sda 422 7632 12919 7970 19 0 45 8040 0 9120 16010
May  4 20:10:11 unraid diskstats[7215]:    8       1 sda1 415 7629 12839 7960 19 0 45 8040 0 9120 16000
May  4 20:10:11 unraid diskstats[7215]:    8      16 sdb 7088371 440329409 3579341232 24182740 24 24 384 220 2 69083830 140921620
May  4 20:10:11 unraid diskstats[7215]:    8      17 sdb1 7088360 440329397 3579341048 24182340 24 24 384 220 2 69083430 140921200
May  4 20:10:11 unraid diskstats[7215]:    8      32 sdc 7148960 442562283 3597689360 48958880 52764 1472743 12204056 12406600 4 71324910 236664230
May  4 20:10:11 unraid diskstats[7215]:    8      33 sdc1 7148949 442562271 3597689176 48958440 52764 1472743 12204056 12406600 4 71324470 236663770
May  4 20:10:11 unraid diskstats[7215]:    8      48 sdd 7159457 442156089 3594522528 32636890 64232 1837254 15211896 9428250 5 70728250 334033500
May  4 20:10:11 unraid diskstats[7215]:    8      49 sdd1 7159446 442156077 3594522344 32636450 64232 1837254 15211896 9428250 5 70727810 334033070
May  4 20:10:11 unraid diskstats[7215]:    8      64 sde 7081552 440698101 3582237616 31645990 14 1 120 20 1 73349650 90014730
May  4 20:10:11 unraid diskstats[7215]:    8      65 sde1 7081541 440698089 3582237432 31645570 14 1 120 20 1 73349230 90014310
May  4 20:10:11 unraid diskstats[7215]:    8      80 sdf 7104503 440561977 3581330888 23866000 4615 692 42456 539870 2 74724530 141144810
May  4 20:10:11 unraid diskstats[7215]:    8      81 sdf1 7104492 440561965 3581330704 23865570 4615 692 42456 539870 2 74724090 141144370
May  4 20:10:11 unraid diskstats[7215]:    8      96 sdg 7122695 441770958 3591148328 49170820 6794 363824 2964944 438800 3 76261020 169512530
May  4 20:10:11 unraid diskstats[7215]:    8      97 sdg1 7122684 441770946 3591148144 49170420 6794 363824 2964944 438800 3 76260610 169512080
May  4 20:10:11 unraid diskstats[7215]:    9       1 md1 56401 0 6169264 0 1525507 0 12204056 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    9       2 md2 1254 0 10904 0 48 0 384 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    9       3 md3 14826 0 1962592 0 5307 0 42456 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    9       4 md4 13570 0 2907024 0 15 0 120 0 0 0 0
May  4 20:10:11 unraid diskstats[7215]:    9       5 md5 39028 0 8853624 0 370618 0 2964944 0 0 0 0
May  4 20:10:11 unraid hdparm[7219]:  HDIO_DRIVE_CMD(identify) failed: Invalid exchange
May  4 20:10:11 unraid hdparm[7219]:  HDIO_GET_IDENTITY failed: Invalid argument
May  4 20:10:11 unraid hdparm[7219]: 
May  4 20:10:11 unraid hdparm[7219]: /dev/sda:
May  4 20:10:32 unraid sshd[7375]: Accepted publickey for root from 192.168.79.157 port 49827 ssh2
May  4 20:10:32 unraid sshd[7379]: lastlog_openseek: Couldn't stat /var/log/lastlog: No such file or directory

Quote

May 4, 201214 yr

Don't worry about this:

ls: cannot access /dev/hd[a-z]: No such file or directory

as you probably don't have any IDE disks, and only have SATA disks on your server. (They are all /dev/sd[a-z])

Other than that, the errors you listed show some UNC media errors. (un-correctable read errors on disks, indicating one of your disks probably have sectors pending re-allocation.)

Please post smart reports for each of your disks.

If you have unmenu, they can be obtained from the Disk Management page in it.

Joe L.

Quote

May 4, 201214 yr

Author

Of course, I should have thought of that already. Here are the reports. One disk seems to have a lot of raw read errors. Another has a handful. The disks are about 1 year old. A read error should not lock up the whole system for so long, though and make it impossible to shutdown?

Shouldn't unraid just report these errors so I can replace the faulty disk (if it is a faulty disk) and rebuild from parity?

Thanks.

UPDATE: Looks like the two disks with any raw read errors are older EARS disks, and the other four are newer EARX disks. I think the EARS disks could be 2 years old compared to the 1 year old EARX disks.

Should I move data off one or both of these disks, and replace one or both of them as soon as possible?

unraid-smart-reports.txt

Quote

May 4, 201214 yr

Yes. Copy the data off the drives. Enter "initconfig" or click "New Config" in version 5 to clear parity and reset the array. Then unassign the drives and rebuild parity. Run pre-clear on the drives and report the "current pending" and relocated sector counts.

Quote

May 5, 201214 yr

Author

I can't seem to copy data off the disks. At some point I get the same symptoms where unRAID and unmenu web interfaces become unresponsive and I can't shutdown cleanly.

unRAID tells me that parity is valid, but that's after doing a hard reset and starting the array in maintenance mode. If I start normally, it starts a correcting parity check automatically and eventually hangs at some point every time.

How is unRAID going to help me not lose data in this case? Am I supposed to just remove one of the old disks (the one with 1400 raw read errors, not the one with 6), put a new disk in, and rebuild from parity?

If the parity was wrong, at least I could still connect the old disk to any linux system and try to recover data off it?

Can I just remove that disk now and run the array in a degraded state (without any protection against another failure), and still access the reconstructed data, while I wait for my new disk to arrive? (I don't have a spare disk lying around)

Thanks.

Quote

May 5, 201214 yr

Author

How can I find out for sure which disk(s) are having a problem? Looking at SMART reports again, the parity disk has raw read errors up to 8 now from 6 yesterday. The other disk still has the same value at 1523.

Quote

May 5, 201214 yr

Author

So, I accidentally pulled the wrong disk (didn't double check the serial numbers against the SMART report), and saw that I could "start" the array (unprotected), which I did.

When I realised, I went to put that disk back and pull the correct one.

Now the disk I previously pulled is unassigned, and if I reassign it, it has a blue ball, which means new disk?

This is not a new disk, and as far as I can tell this disk has no had no read errors.

Is it now impossible for me to rebuild the disk which does have read errors from parity, because I have "too many missing or invalid" disks?

How can I get unRAID to re-recognise the disks I already have? I am sure there were no writes to the array when I started it with a missing disk.

Alternatively, can I mount all my disks in read-only mode as regular disks (not part of an array) and try to copy all my data off?

Quote

May 5, 201214 yr

How can I find out for sure which disk(s) are having a problem? Looking at SMART reports again, the parity disk has raw read errors up to 8 now from 6 yesterday. The other disk still has the same value at 1523.

Raw read errors are meaningless to anybody except the manufacturer. All disks have them. Some disks report them, some do not. Look at the normalized value for that parameter... It is probably unchanged from its starting value.

Quote

May 5, 201214 yr

So, I accidentally pulled the wrong disk (didn't double check the serial numbers against the SMART report), and saw that I could "start" the array (unprotected), which I did.

When I realised, I went to put that disk back and pull the correct one.

Now the disk I previously pulled is unassigned, and if I reassign it, it has a blue ball, which means new disk?

This is not a new disk, and as far as I can tell this disk has no had no read errors.

Is it now impossible for me to rebuild the disk which does have read errors from parity, because I have "too many missing or invalid" disks?

How can I get unRAID to re-recognise the disks I already have? I am sure there were no writes to the array when I started it with a missing disk.

Alternatively, can I mount all my disks in read-only mode as regular disks (not part of an array) and try to copy all my data off?

In earlier 4.7 (and prior) versions of unRAID it was possible to force the disks to all be valid using some command line commands. In the later 5.0beta series that changed, as the "md" device driver is re-loaded every time the web-interface is refreshed, un-doing the command you use to force a specific disk invalid. Therefore, the technique no longer works as originally described.

Seek help from lime-technology. Yes, you can force a new disk configuration, but that immediately invalidates parity, preventing any re-construction of a failed disk.

Joe L.

Quote

May 5, 201214 yr

Author

So it sounds like I am pretty much screwed. I have no idea which disk could be having an issue, or even if it is actually a disk having issues is causing this problem every day. I can (with help from Tom) force my config to be valid again, but then I lose parity and any chance of rebuilding a failing disk, if there actually is one.

Would my best bet to recover data be to simply mount all my disks as regular disks in read only mode, run reiserfsck --check on each one... try to copy all the data of each one... hopefully identify the faulty disk (if there is one), and then just try to recover data from that disk with reiserfsck?

Quote

May 5, 201214 yr

So it sounds like I am pretty much screwed. I have no idea which disk could be having an issue, or even if it is actually a disk having issues is causing this problem every day. I can (with help from Tom) force my config to be valid again, but then I lose parity and any chance of rebuilding a failing disk, if there actually is one.

Would my best bet to recover data be to simply mount all my disks as regular disks in read only mode, run reiserfsck --check on each one... try to copy all the data of each one... hopefully identify the faulty disk (if there is one), and then just try to recover data from that disk with reiserfsck?

Plug all your disks back in, where they originally were. Assuming you did not harm any in plugging/un-plugging them, and they really are not bad other than the few un-readable sectors that you might have, they should all work. Perhaps unRAID will then let you start the array.

Oh yes.... Label your disks with their serial numbers where it is easily visible. Always go by the serial number.

Always stop the array and SAVE A COPY OF THE CONFIG DIRECTORY before maknig any changes... That way, you can restore a disk configuration by replacing the old CONFIG directory files.

You could use the technique you described... but as you said, you do not have parity to assist you in re-constructing a truly failed drive.

Joe L.

Quote

May 5, 201214 yr

Author

I put all the disks back in as they were, but it still shows the one I originally removed as a blue ball, which I think means new disk? It gives me an option to start the array, and initiate a rebuild. But I can't do that until I know which (if any) disks are faulty. Every day for the past week it seems unRAID has failed to complete a parity check, so I doubt a rebuild would work, especially if I am rebuilding the wrong disk.

There is also an option to start the array in maintenance mode, but will this also initiate a data rebuild?

Must I contact Tom for secret instructions on how to force my configuration to be valid again, without invalidating parity? If I can only do it by invalidating parity, then there is nothing to be gained over mounting the disks directly and trying to copy date off and check/repair with reiserfsck?

Thanks.

Quote

May 5, 201214 yr

This looks like the problem disk: WD-WMAZA3547704. It is the only one with pending sectors.

Quote

May 5, 201214 yr

Author

Is that the Current_Pending_Sector stat from the SMART report? If so, that is 8 for WDC_WD20EARS-00MVWB0_WD-WMAZA3492132 (which is my parity disk) and 6 for WDC_WD20EARS-00MVWB0_WD-WMAZA3547704.

What should I do next?

Why does a pending re-allocation condition cause unRAID to become unresponsive in this way?

Shouldn't the disk(s) just mark those sectors as bad if it truly can't read from them, which would result in a parity sync error and either I can rebuild from parity, or replace the parity disk depending on which is faulty?

Thanks.

UPDATE: It's actually 7 and 5 for Current_Pending_Sector now, respectively, but it was 8 and 6 when the original SMART report was taken.

Quote

May 6, 201214 yr

Author

Well, I actually managed to copy all my data off disks 2-5 onto other machines. One of those disks (disk 2) was one of the EARS drives with a Current_Pending_Sector of 5. Disk 1 had data I don't care about (and don't have enough space on other machines to store temporarily). The parity disk is the other EARS drive and it still has a Current_Pending_Sector of 7.

I emailed support about finding a way to force my configuration to valid again, but heard no response.

So ATM my array is unprotected (but at least I copied the data I care about to another machine), I can't start it without initiating a rebuild, of disk 3 (which I have no reason to believe is faulty -- but the parity or disk 2 might be, even though I could also read all data off disk 2).

So what is my next step here?

How do I determine which (if any) disk is faulty? I don't really want to run a 24 plus hour pre-clear on all my drives again, and in the mean time all my data is unprotected on another machine.

How do I get unRAID running again, without becoming unresponsive and initiating a parity check every day that never finishes before becoming unresponsive again?

Quote

May 6, 201214 yr

The disks with pending sectors were the initial cause of the problem. They both needed to be rebuilt. Since two drives have unreadable sectors there is no way to successfully rebuild any of them. The drives with pending sectors should be pre-cleared.

[*]Enter "initconfig" or click "New Config" in version 5 to clear parity and reset the array.

[*]Unassign the drives with pending sectors.

[*]Start the array with the remaining disks assigned and rebuild parity. (Only the data on the 2 unassigned disks will be missing.)

[*]Pre-clear the un-assigned drives and add them back to the array. (Make sure the pre-clear results look ok.)

Quote

May 7, 201214 yr

Author

Since I managed to copy off all my data, I did initconfig and unassigned all drives. I ran the latest preclear on all 6 drives (different consoles). Again, after roughly 7 hours ALL disk activity has seemed to lock up again.

I can still switch between each console, and see that all 6 disks are between 91% and 94% of the pre-read phase. They have been going for at least 15 hours, but all 6 consoles show between 7:05 and 7:07 elapsed time (because it took me a couple minutes to start them all individually).

Is it just me, or should unRAID not do this, even if one or even two disks are bad?

unRAID has never actually reported any errors to me. It just kept locking up every day. I had to rely on a 3rd party tool like unmenu just to view SMART status and take a guess at which drives might be causing trouble, but the only way to find out is to actually wipe the suspected drives and hope for another lockup?

At this point I'm pretty disappointed with unRAID's behaviour. I've still heard nothing from Lime Tech after emailing for some official support, too.

Quote

May 7, 201214 yr

Author

I reset and tried to pre-clear just one of the disks, and it froze at 91% again. Then I opened a second console and tried to pre-clear the other disk, and I got this error:

Sorry: /dev//dev/sdb does not exist as a block device
Clearing will NOT be performed

But preclear_disk.sh -l does list /dev/sdb as my EARS *704 drive.

Attached the syslog.

Thanks.

unraid-syslog.txt

Quote

May 8, 201214 yr

Author

I tried pre-clearing just the other disk with suspected faults (*704, which was a data disk). It has made it through the initial phase (pre-read?) and is now on step 2 of 10 (4%, 7.5 hours).

So I guess the old parity disk was the one that was causing disk lockups, and the data disk that had sectors pending relocation was actually OK.

Will see if the data disk makes it all the way through the pre-clear.

Quote

Partial crash every day?

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)