Server dies completely after copying data for several hours -SMART logs attached

mark_anderson_us · February 6, 2013

I have an HP N36L 5GB RAM, 2TB + 500GB data disks and a 2TB parity drive

All disks pre-cleared without errors. Twice now, I've set 2Tb of movies copying to a user share and after several hours it totally crashes my server. I have to pull the power to get it to startup again. (The server has been running great for the last year with WHS)

I'm running 5-rc10

Unfortunately, the syslog contains nothing except data since the restart (see below)

Anyone got any ideas, or is there anything else I can provide to help troubleshoot. I've been trying to get this up an running for two weeks solid now, with pre-clearing, copying massive amounts of data, etc. When this happens I then have to wait 8 hours for the parity to complete. I can't keep setting copies going that die after a day and being back to square one. Every timeit happens, I've lost 2 days. I'm sure if I copy smaller batches it might work, but I'm not prepared to risk all my data with something that only works if I copy small amount of data.

Disk Info

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30WGKWD -> ../../sdc

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30WGKWD-part1 -> ../../sdc1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30X4ZVD -> ../../sdb

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30X4ZVD-part1 -> ../../sdb1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 ata-ST3500320AS_9QM05PF4 -> ../../sda

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 ata-ST3500320AS_9QM05PF4-part1 -> ../../sda1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30WGKWD -> ../../sdc

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30WGKWD-part1 -> ../../sdc1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30X4ZVD -> ../../sdb

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30X4ZVD-part1 -> ../../sdb1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 scsi-SATA_ST3500320AS_9QM05PF4 -> ../../sda

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 scsi-SATA_ST3500320AS_9QM05PF4-part1 -> ../../sda1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 usb-Lexar_JD_FireFly_AAA7ANPDVSR2AQRA-0:0 -> ../../sdd

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 usb-Lexar_JD_FireFly_AAA7ANPDVSR2AQRA-0:0-part1 -> ../../sdd1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 wwn-0x5000c500027e34fc -> ../../sda

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 wwn-0x5000c500027e34fc-part1 -> ../../sda1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 wwn-0x5000cca369cc7cbd -> ../../sdc

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 wwn-0x5000cca369cc7cbd-part1 -> ../../sdc1

lrwxrwxrwx 1 root root 9 2013-02-05 18:05 wwn-0x5000cca369cccd24 -> ../../sdb

lrwxrwxrwx 1 root root 10 2013-02-05 18:05 wwn-0x5000cca369cccd24-part1 -> ../../sdb1

Syslog attached

Syslog.txt

danioj · February 6, 2013

If there is nothing in your nas log's, perhaps the cause of the restart is unrelated.

hardware failure, heat tolerance levels exceeded etc ...

I had a server I was running for ages start dying on me when I had recently installed something new and thought it was related. Turns out I had neglected cleaning the thing for so long it was dusty etc and by moving it to install something new I must have disturbed things and suddenly it started over heating ...

Might be way off, but something to consider!

mark_anderson_us · February 6, 2013

If there is nothing in your nas log's, perhaps the cause of the restart is unrelated.

hardware failure, heat tolerance levels exceeded etc ...

I had a server I was running for ages start dying on me when I had recently installed something new and thought it was related. Turns out I had neglected cleaning the thing for so long it was dusty etc and by moving it to install something new I must have disturbed things and suddenly it started over heating ...

Might be way off, but something to consider!

Thanks

Maybe I'll get some compressed air at weekend

tw332 · February 6, 2013

One of the pros may well respond but if this is a re-occuring issue, I think it may be worth trying to capture the log leading up to the server crashes - the log attached doesn't cover much.

Are you able to mount the disk(s) with the exising data in the unRAID server? It would probably transfer faster and may prevent the problem.

mark_anderson_us · February 6, 2013

So I started a putty session and then another large copy. It failed after a few hours, but I acceindentally closed the window: it was similar to what you see below. No panics or anything liek that. Don't see anything in the log (see below). I've set copying again. I'm sure it will fail at some point today.

I ran memtest all night, no errors

I could mount the disks locally, but that's just masking the problem. There's no way I'm trusting my data to something that dies like that 9whether it eventually turns out to be HW or unRAID)

This is syslog after it came back up (starting another copy now)

Feb 6 07:19:13 NAS kernel: md: recovery thread checking parity...

Feb 6 07:19:13 NAS kernel: md: using 1536k window, over a total of 1953514552 blocks.

Feb 6 07:19:14 NAS emhttp: shcmd (39): :>/etc/samba/smb-shares.conf

Feb 6 07:19:14 NAS emhttp: get_config_idx: fopen /boot/config/shares/New folder.cfg: No such file or directory - assigning defaults

Feb 6 07:19:14 NAS emhttp: Restart SMB...

Feb 6 07:19:14 NAS emhttp: shcmd (40): killall -HUP smbd

Feb 6 07:19:14 NAS emhttp: shcmd (41): ps axc | grep -q rpc.mountd

Feb 6 07:19:14 NAS emhttp: _shcmd: shcmd (41): exit status: 1

Feb 6 07:19:14 NAS emhttp: shcmd (42): /usr/local/sbin/emhttp_event svcs_restarted

Feb 6 07:19:14 NAS emhttp_event: svcs_restarted

mark_anderson_us · February 6, 2013

Hi Guys

Attached are my SMART logs. Drive A looks worrying. These all have high numbers:

Raw_Read_Error_Rate

Seek_Error_Rate

Hardware_ECC_Recovered

Reallocated_Sector_Ct is 0 though

Drive only has 12K power on hours

B also has some errors

ATA Error Count: 19 (device log contains only the most recent five errors)

smarta.txt

smartb.txt

smartc.txt

tw332 · February 6, 2013

Attached are my SMART logs. Drive A looks worrying. These all have high numbers:

Raw_Read_Error_Rate

Seek_Error_Rate

Hardware_ECC_Recovered

Reallocated_Sector_Ct is 0 though

Drive only has 12K power on hours

B also has some errors

ATA Error Count: 19 (device log contains only the most recent five errors)

I also, if it were my server, would like to get to the bottom of the issue and would rather find out what's wrong than leave it with the possibility of the of future problems.

In my opinion, with the errors above, it looks like the drive is having difficulty reading part of the data. However, without knowing how the values are changing (eg the rate they're increasing, if they are in fact increasing) it would be difficult to determine if that's a drive issue, interface issue or something else entirely.

Again, hopefully one of the more experienced members can comment.

dgaschk · February 6, 2013

Attach the entire syslog. Collect the syslog after a failure.

mark_anderson_us · February 6, 2013

Will do. Waiting for it to fail again.

Server dies completely after copying data for several hours -SMART logs attached

Recommended Posts

mark_anderson_us

Link to comment

danioj

Link to comment

mark_anderson_us

Link to comment

tw332

Link to comment

mark_anderson_us

Link to comment

mark_anderson_us

Link to comment

tw332

Link to comment

dgaschk

Link to comment

mark_anderson_us

Link to comment

Archived