January 8, 200818 yr Hi, sometimes i lost the web interface but unRaid work fine, i can read/write files on it.
January 9, 200818 yr Next time this happens, please try to capture the system log via a console and send to me ([email protected]).
February 23, 200818 yr Author I always have the same problem, any idea ? No news from limetech, yet i sent my log file.
February 25, 200818 yr Author Yes of course My log file is too big, so you can download it here : http://ecaillet95.free.fr/log.txt I think i have some others problems not just web interface, can you explain to me what ? Thanks a lot.
February 25, 200818 yr There are 2 problems, completely different, but both need fixing. One is the networking, as you have seen, and the other is the read errors on your parity drive. You may need to consider replacing that drive, or at least initiate some surface testing on it. The networking problem acts like the connection is loose, could be loose wires in the cable or a connector at either end. You often have no Ethernet connection available for long hours, then it will come up and stay up for many hours, then be gone again for quite awhile. I would replace the cable, and if that doesn't help, then add a network card to your unRAID server. Most network plugs have little lights or indicators near them, that indicate an active Ethernet connection, and possibly the speed. That might help in diagnosing why it is up and down, by playing with the cable and connectors and watching the lights. Because very large syslogs are usually very large because of a lot of redundant errors, they compress way down. A zipped copy can usually be attached. (I'm very sleepy, hope I didn't make any mistakes.)
February 25, 200818 yr Same problem here. It happens when I go to the shares page on the web interface. My syslog: http://rapidshare.com/files/94928555/Syslog.zip.html Can someone take a look at it? Tanks
February 25, 200818 yr Author RobJ thanks a lot for your answer. How can i do a surface testing on my HDD under unRAID system ? For the second problem i will replace this cable, it's a good idea. My actually ethernet card have a little light, it's a D-Link DGE-528T recommanded by a lot of unRAID user. Normally it's work fine. The result in a future issue...
February 26, 200818 yr How can i do a surface testing on my HDD under unRAID system ? I don't know of a way, within unRAID, perhaps someone else does. Do you have SpinRite or HDD Regenerator, or a similar tool, perhaps on a LiveCD? Suggestions from another user (WeeboTech) that I haven't tried yet but sound interesting: http://lime-technology.com/forum/index.php?topic=1487.msg10166#msg10166. To Trek_B: It is better 'Netiquette' to start a new topic for your problem. I did take a look at your syslog, and it does not appear to be similar to tafkap's problems, you have no drive errors and no apparent network problems. You do have error handlers running, in the process list. When you start a new topic, provide details about your hardware, and what problems you are seeing.
February 27, 200818 yr To Trek_B: It is better 'Netiquette' to start a new topic for your problem. I did take a look at your syslog, and it does not appear to be similar to tafkap's problems, you have no drive errors and no apparent network problems. You do have error handlers running, in the process list. When you start a new topic, provide details about your hardware, and what problems you are seeing. Tanks for taking a look at my syslog. Regarding the lack of 'Netiquette', I think you did not understand correctly. Has I said, I have the same problem, sometimes unraid loses the web interface with the network access ok, exactly like Tafkap (read the first post on this topic). Since the problem is the same, I believe it is better to use this topic to avoid the information to become split by several topics. I think this must be some kind of bug. It started ocurring randomly, but now, if I go to the shares page, unraid loses the web interface. Regards
February 27, 200818 yr How can i do a surface testing on my HDD under unRAID system ? I don't know of a way, within unRAID, perhaps someone else does. Do you have SpinRite or HDD Regenerator, or a similar tool, perhaps on a LiveCD? Suggestions from another user (WeeboTech) that I haven't tried yet but sound interesting: http://lime-technology.com/forum/index.php?topic=1487.msg10166#msg10166. I'd recommend looking at this thread (http://lime-technology.com/forum/index.php?topic=1521.msg10384#msg10384). There is an attachment called smartctl.zip on one of the posts. Follow the instructions and run it on your parity drive. Review the results and see if there is anything alarming. Feel free to post the results also, and folks on this forum will give you their take as well. In general, unRAID is aware and makes good choices about employing the sector remapping (SMART) features of hard drives. If you get a read error on your parity drive, unRAID is (should be) doing the right things to get the drive to remap the bad sector, and using its redundancy features to correctly rewrite that sector. In other words, once it happens once, unRAID will fix it and it should not happen again. If you are getting the same error multiple times, there are a few possibilities: 1 - You do not have valid parity. unRAID can only do its magic if parity is intact 2 - The drive has used up its supply of remappable sectors (not a good thing!) - smartctl should show this - if this is the case, RMA the drive ASAP 3 - Your errors ARE being corrected, but new read errors are cropping up on the drive (a very bad thing!) - running smartctl before and after a parity check and comparing results should tell you - RMA the drive ASAP if bad sectors are developing like this 4 - UnRaid has a bug and somehow is not doing its thing - If it isn't 1,2, or 3, and your smartctl report looks ok, I'd report the issue to Tom and follow his suggestions to help diagnose -Brian
February 29, 200818 yr Author But it's very strange, before my unRAID server worked very well, and suddenly not. I'm going to test this, thanks. Where is limetech for help ?
February 29, 200818 yr But it's very strange, before my unRAID server worked very well, and suddenly not. I'm going to test this, thanks. Where is limetech for help ? This user community is an integral part of the lime-tech support model, so you are actually getting help. Much like open-source software, many smaller (and some not so small) high tech companies have adopted the distributed support model for solid reasons: it offers better response time at a lower cost. As a kicker, it tends to build loyalty. Bill
February 29, 200818 yr Same problem here. It happens when I go to the shares page on the web interface. My syslog: http://rapidshare.com/files/94928555/Syslog.zip.html Can someone take a look at it? Tanks First tried to look at this and got a page that looked like I needed to sign up for someting Upon closer inspection, I did download your system log. It looks ok, except at the very end there is a line which is not right: Jan 25 00:10:17 Vouga emhttp[1358]: get_var: rdevStatus.16 not found Via telnet session, please type following command: mdcmd status >/boot/status.txt You should now see a file called 'status.txt' in your Flash share. Please send to [email protected]
March 3, 200818 yr Author Hi, this is the result of the command : smartctl -a -d ata -F samsung /dev/sda smartctl version 5.36 [i486-slackware-linux-gnu] Copyright © 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HD501LJ Serial Number: S0MUJDSP617144 Firmware Version: CR100-10 User Capacity: 500,107,862,016 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Not recognized. Minor revision code: 0x52 Local Time is: Mon Mar 3 20:56:15 2008 GMT+8 ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (8741) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 149) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 099 051 Pre-fail Always - 148 3 Spin_Up_Time 0x0007 100 100 015 Pre-fail Always - 7232 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 465 5 Reallocated_Sector_Ct 0x0033 086 086 010 Pre-fail Always - 138 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 253 253 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 558 10 Spin_Retry_Count 0x0033 253 253 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 253 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 182 187 Unknown_Attribute 0x0032 001 001 000 Old_age Always - 291197 188 Unknown_Attribute 0x0032 253 253 000 Old_age Always - 0 190 Unknown_Attribute 0x0022 068 048 000 Old_age Always - 32 194 Temperature_Celsius 0x0022 142 082 000 Old_age Always - 32 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 99698562 196 Reallocated_Event_Count 0x0032 086 086 000 Old_age Always - 138 197 Current_Pending_Sector 0x0012 253 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 253 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 098 096 000 Old_age Always - 141 202 TA_Increase_Count 0x0032 001 001 000 Old_age Always - 32767 SMART Error Log Version: 1 ATA Error Count: 9474 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 9474 occurred at disk power-on lifetime: 11778 hours (490 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 27 c2 25 ea Error: UNC at LBA = 0x0a25c227 = 170246695 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 27 c2 25 ea 00 3d+21:27:53.088 READ DMA ec 00 00 00 00 00 a0 00 21d+18:07:29.280 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 21d+18:07:29.280 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 21d+18:07:29.280 IDENTIFY DEVICE Error 9473 occurred at disk power-on lifetime: 11778 hours (490 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 27 c2 25 ea Error: UNC at LBA = 0x0a25c227 = 170246695 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 27 c2 25 ea 00 47d+04:36:39.040 READ DMA ec 00 00 00 00 00 a0 00 15d+12:54:10.688 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 15d+12:54:10.688 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 15d+12:54:10.688 IDENTIFY DEVICE Error 9472 occurred at disk power-on lifetime: 11778 hours (490 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 27 c2 25 ea Error: UNC at LBA = 0x0a25c227 = 170246695 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 27 c2 25 ea 00 40d+23:23:20.448 READ DMA ec 00 00 00 00 00 a0 00 9d+07:40:51.840 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 9d+07:40:51.840 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 9d+07:40:51.840 IDENTIFY DEVICE Error 9471 occurred at disk power-on lifetime: 11778 hours (490 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 27 c2 25 ea Error: UNC at LBA = 0x0a25c227 = 170246695 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 27 c2 25 ea 00 34d+18:10:01.600 READ DMA ec 00 00 00 00 00 a0 00 3d+02:27:33.248 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 3d+02:27:33.248 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 3d+02:27:33.248 IDENTIFY DEVICE Error 9470 occurred at disk power-on lifetime: 11778 hours (490 days + 18 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 27 c2 25 ea Error: UNC at LBA = 0x0a25c227 = 170246695 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 27 c2 25 ea 00 28d+12:56:43.008 READ DMA ec 00 00 00 00 00 a0 00 8d+22:10:42.048 IDENTIFY DEVICE ef 03 46 00 00 00 a0 00 46d+09:36:19.200 SET FEATURES [set transfer mode] ec 00 00 00 00 00 a0 00 34d+08:39:51.808 IDENTIFY DEVICE SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective Self-Test Log Data Structure Revision Number (0) should be 1 SMART Selective self-test log data structure revision number 0 Warning: ATA Specification requires selective self-test log data structure revision number = 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. So my HDD parity is good ? i suppose not, but when i use command smartctl -H -d ata -F samsung /dev/sda, SMART overall-health self-assessment test result: PASSED Thanks for help
March 3, 200818 yr From what I can see in the smart log, it looks as though there are interface issues. Might want to reseat the cables or just be sure they are all snug. I would suggest you trigger an internal drive test for good measure code] smartctl -t short -d ata -F samsung /dev/sda or smartctl -t long -d ata -F samsung /dev/sda There are 138 reallocated sectors, luckily there are no pending sectors. If an error is detected the short and long test will reveal it.
March 3, 200818 yr He's right, 138 remapped sectors is pretty bad, especially when you only have 558 hours on it. You should not have even one reallocated sector yet. You got a healthy assessment because there are more sectors in reserve, that can be remapped. I would pull this drive from the array, use it elsewhere, with non-essential data, perhaps in an enclosure for an extra backup device, a third backup. Or look into RMA'ing it?
March 4, 200818 yr Do the short and long tests. If it does not pass, you can RMA the drive without question.
March 4, 200818 yr Author Ok thanks to all, bad news for me but fortunately i have a new identical disk so i'm going to change it and repair my parity.
March 7, 200818 yr Author Hi all, some news... Today i bought a new disc Samsung 500Go to replace my parity disc which seems to be a problem. Well, the system is running and the parity-sync is in progress. I also bought a new cable RJ45. I hope that everything will work fine
March 10, 200818 yr Author This morning i still lost my web interface... my ethernet cable is new... my log : Mar 9 21:44:28 Tower kernel: [181300.338672] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338680] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338683] FAT: Directory bread(block 3856) failed Mar 9 21:44:28 Tower kernel: [181300.338685] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338687] FAT: Directory bread(block 3857) failed Mar 9 21:44:28 Tower kernel: [181300.338690] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338692] FAT: Directory bread(block 3858) failed Mar 9 21:44:28 Tower kernel: [181300.338694] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338696] FAT: Directory bread(block 3859) failed Mar 9 21:44:28 Tower kernel: [181300.338698] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338700] FAT: Directory bread(block 3860) failed Mar 9 21:44:28 Tower kernel: [181300.338703] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338705] FAT: Directory bread(block 3861) failed Mar 9 21:44:28 Tower kernel: [181300.338707] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338709] FAT: Directory bread(block 3862) failed Mar 9 21:44:28 Tower kernel: [181300.338711] scsi 0:0:0:0: rejecting I/O to dead device Mar 9 21:44:28 Tower kernel: [181300.338713] FAT: Directory bread(block 3863) failed Mar 9 21:44:28 Tower logger: /etc/rc.d/rc.inet1: /sbin/ifconfig eth0 down Mar 9 21:44:29 Tower ifplugd(eth0)[1076]: Program executed successfully. Mar 9 21:44:29 Tower kernel: [181300.397412] r8169: eth0: link down Mar 10 05:15:26 Tower kernel: [208305.370744] r8169: eth0: link up Mar 10 05:15:26 Tower ifplugd(eth0)[1076]: Link beat detected. Mar 10 05:15:27 Tower kernel: [208306.217729] r8169: eth0: link down Mar 10 05:15:27 Tower ifplugd(eth0)[1076]: Link beat lost. Mar 10 05:15:29 Tower kernel: [208309.075137] r8169: eth0: link up Mar 10 05:15:30 Tower ifplugd(eth0)[1076]: Link beat detected. Mar 10 05:15:31 Tower ifplugd(eth0)[1076]: Executing '/etc/ifplugd/ifplugd.action eth0 up'. Mar 10 05:15:31 Tower ifplugd(eth0)[1076]: client: /etc/rc.d/rc.inet1.conf: line 18: /boot/config/network.cfg: No such file or directory Mar 10 05:15:31 Tower logger: /etc/rc.d/rc.inet1: /sbin/ifconfig eth0 192.168.0.2 broadcast 192.168.0.255 netmask 255.255.255.0 Mar 10 05:15:31 Tower ifplugd(eth0)[1076]: Program executed successfully. Mar 10 05:16:25 Tower kernel: [208364.419099] r8169: eth0: link down Mar 10 05:16:25 Tower ifplugd(eth0)[1076]: Link beat lost. Mar 10 05:16:27 Tower kernel: [208366.378689] r8169: eth0: link up Mar 10 05:16:27 Tower ifplugd(eth0)[1076]: Link beat detected. Mar 10 05:42:19 Tower in.telnetd[3227]: connect from 192.168.0.5 (192.168.0.5) Mar 10 05:42:21 Tower login[3228]: ROOT LOGIN on `pts/0' from `192.168.0.5'
March 10, 200818 yr If I am reading this right, the temperature on this drive at the time this reading was take was extremely high! You might want to consider extra cooling measures in your case.
Archived
This topic is now archived and is closed to further replies.