mark_anderson_us Posted February 6, 2013 Share Posted February 6, 2013 I have an HP N36L 5GB RAM, 2TB + 500GB data disks and a 2TB parity drive All disks pre-cleared without errors. Twice now, I've set 2Tb of movies copying to a user share and after several hours it totally crashes my server. I have to pull the power to get it to startup again. (The server has been running great for the last year with WHS) I'm running 5-rc10 Unfortunately, the syslog contains nothing except data since the restart (see below) Anyone got any ideas, or is there anything else I can provide to help troubleshoot. I've been trying to get this up an running for two weeks solid now, with pre-clearing, copying massive amounts of data, etc. When this happens I then have to wait 8 hours for the parity to complete. I can't keep setting copies going that die after a day and being back to square one. Every timeit happens, I've lost 2 days. I'm sure if I copy smaller batches it might work, but I'm not prepared to risk all my data with something that only works if I copy small amount of data. Disk Info lrwxrwxrwx 1 root root 9 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30WGKWD -> ../../sdc lrwxrwxrwx 1 root root 10 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30WGKWD-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30X4ZVD -> ../../sdb lrwxrwxrwx 1 root root 10 2013-02-05 18:05 ata-Hitachi_HDS5C3020ALA632_ML0220F30X4ZVD-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 ata-ST3500320AS_9QM05PF4 -> ../../sda lrwxrwxrwx 1 root root 10 2013-02-05 18:05 ata-ST3500320AS_9QM05PF4-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30WGKWD -> ../../sdc lrwxrwxrwx 1 root root 10 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30WGKWD-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30X4ZVD -> ../../sdb lrwxrwxrwx 1 root root 10 2013-02-05 18:05 scsi-SATA_Hitachi_HDS5C30_ML0220F30X4ZVD-part1 -> ../../sdb1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 scsi-SATA_ST3500320AS_9QM05PF4 -> ../../sda lrwxrwxrwx 1 root root 10 2013-02-05 18:05 scsi-SATA_ST3500320AS_9QM05PF4-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 usb-Lexar_JD_FireFly_AAA7ANPDVSR2AQRA-0:0 -> ../../sdd lrwxrwxrwx 1 root root 10 2013-02-05 18:05 usb-Lexar_JD_FireFly_AAA7ANPDVSR2AQRA-0:0-part1 -> ../../sdd1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 wwn-0x5000c500027e34fc -> ../../sda lrwxrwxrwx 1 root root 10 2013-02-05 18:05 wwn-0x5000c500027e34fc-part1 -> ../../sda1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 wwn-0x5000cca369cc7cbd -> ../../sdc lrwxrwxrwx 1 root root 10 2013-02-05 18:05 wwn-0x5000cca369cc7cbd-part1 -> ../../sdc1 lrwxrwxrwx 1 root root 9 2013-02-05 18:05 wwn-0x5000cca369cccd24 -> ../../sdb lrwxrwxrwx 1 root root 10 2013-02-05 18:05 wwn-0x5000cca369cccd24-part1 -> ../../sdb1 Syslog attached Syslog.txt Link to comment
danioj Posted February 6, 2013 Share Posted February 6, 2013 If there is nothing in your nas log's, perhaps the cause of the restart is unrelated. hardware failure, heat tolerance levels exceeded etc ... I had a server I was running for ages start dying on me when I had recently installed something new and thought it was related. Turns out I had neglected cleaning the thing for so long it was dusty etc and by moving it to install something new I must have disturbed things and suddenly it started over heating ... Might be way off, but something to consider! Link to comment
mark_anderson_us Posted February 6, 2013 Author Share Posted February 6, 2013 If there is nothing in your nas log's, perhaps the cause of the restart is unrelated. hardware failure, heat tolerance levels exceeded etc ... I had a server I was running for ages start dying on me when I had recently installed something new and thought it was related. Turns out I had neglected cleaning the thing for so long it was dusty etc and by moving it to install something new I must have disturbed things and suddenly it started over heating ... Might be way off, but something to consider! Thanks Maybe I'll get some compressed air at weekend Link to comment
tw332 Posted February 6, 2013 Share Posted February 6, 2013 One of the pros may well respond but if this is a re-occuring issue, I think it may be worth trying to capture the log leading up to the server crashes - the log attached doesn't cover much. Are you able to mount the disk(s) with the exising data in the unRAID server? It would probably transfer faster and may prevent the problem. Link to comment
mark_anderson_us Posted February 6, 2013 Author Share Posted February 6, 2013 So I started a putty session and then another large copy. It failed after a few hours, but I acceindentally closed the window: it was similar to what you see below. No panics or anything liek that. Don't see anything in the log (see below). I've set copying again. I'm sure it will fail at some point today. I ran memtest all night, no errors I could mount the disks locally, but that's just masking the problem. There's no way I'm trusting my data to something that dies like that 9whether it eventually turns out to be HW or unRAID) This is syslog after it came back up (starting another copy now) Feb 6 07:19:13 NAS kernel: md: recovery thread checking parity... Feb 6 07:19:13 NAS kernel: md: using 1536k window, over a total of 1953514552 blocks. Feb 6 07:19:14 NAS emhttp: shcmd (39): :>/etc/samba/smb-shares.conf Feb 6 07:19:14 NAS emhttp: get_config_idx: fopen /boot/config/shares/New folder.cfg: No such file or directory - assigning defaults Feb 6 07:19:14 NAS emhttp: Restart SMB... Feb 6 07:19:14 NAS emhttp: shcmd (40): killall -HUP smbd Feb 6 07:19:14 NAS emhttp: shcmd (41): ps axc | grep -q rpc.mountd Feb 6 07:19:14 NAS emhttp: _shcmd: shcmd (41): exit status: 1 Feb 6 07:19:14 NAS emhttp: shcmd (42): /usr/local/sbin/emhttp_event svcs_restarted Feb 6 07:19:14 NAS emhttp_event: svcs_restarted Link to comment
mark_anderson_us Posted February 6, 2013 Author Share Posted February 6, 2013 Hi Guys Attached are my SMART logs. Drive A looks worrying. These all have high numbers: Raw_Read_Error_Rate Seek_Error_Rate Hardware_ECC_Recovered Reallocated_Sector_Ct is 0 though Drive only has 12K power on hours B also has some errors ATA Error Count: 19 (device log contains only the most recent five errors) smarta.txt smartb.txt smartc.txt Link to comment
tw332 Posted February 6, 2013 Share Posted February 6, 2013 Attached are my SMART logs. Drive A looks worrying. These all have high numbers: Raw_Read_Error_Rate Seek_Error_Rate Hardware_ECC_Recovered Reallocated_Sector_Ct is 0 though Drive only has 12K power on hours B also has some errors ATA Error Count: 19 (device log contains only the most recent five errors) I also, if it were my server, would like to get to the bottom of the issue and would rather find out what's wrong than leave it with the possibility of the of future problems. In my opinion, with the errors above, it looks like the drive is having difficulty reading part of the data. However, without knowing how the values are changing (eg the rate they're increasing, if they are in fact increasing) it would be difficult to determine if that's a drive issue, interface issue or something else entirely. Again, hopefully one of the more experienced members can comment. Link to comment
dgaschk Posted February 6, 2013 Share Posted February 6, 2013 Attach the entire syslog. Collect the syslog after a failure. Link to comment
mark_anderson_us Posted February 6, 2013 Author Share Posted February 6, 2013 Will do. Waiting for it to fail again. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.