January 18, 201016 yr I have a little bit of a weird issue that cropped up last night while I was sleeping. I will attach the relevant part of the syslog and try to describe what happened. While I was asleep last night it appears that one of my disks (sdj) a Seagate, decided to not play nice and try to reset itself (which it did) but in the process caused some weirdness. I have u-notify installed and it is reporting that the Seagate drive is not "there." But when I check through unMenu MyMain, unMenu Main, and the unRAID WebGUI everything looks fine. Everything has a green ball beside it and it looks to be happy. The drive letter does look like it got reset to sdl, but I am not sure if that caused a problem or not. unMenu does appear to be trying to look for sdj and get a temp for the drive. Perhaps some smarts needs to be added so that if a drive does get rearranged for whatever reason it does not keep looking in the same spot? Here is what u-notify sends me in the email: Disk Temperature Status ----------------------------------------------------------------- Parity Disk [sdg]: 32°C (DiskId: ata-Hitachi_HDS722020ALA330_JK1131YAG93KBV) Disk 1 [sdd]: 35°C (DiskId: ata-ST3750640AS_5QD5ELLF) Disk 2 [sde]: Spun-Down (DiskId: ata-WDC_WD5000AAKS-00TMA0_WD-WCAPW2595673) Disk 3 [sdh]: 30°C (DiskId: ata-WDC_WD5000AAKS-00TMA0_WD-WCAPW2132942) Disk 4 [sdc]: Spun-Down (DiskId: ata-SAMSUNG_HD753LJ_S13UJ1MQ330294) Disk 5 [sdj]: Not-Reported (DiskId: ata-ST31000333AS_9TE088TH) Disk 6 [sdk]: Spun-Down (DiskId: ata-ST31500341AS_6VS04GWN) Disk 7 [sdi]: Spun-Down (DiskId: ata-Hitachi_HDS722020ALA330_JK1121YAGABMMS) Disk SMART Health Status ----------------------------------------------------------------- Parity Disk PASSED (DiskId: ata-Hitachi_HDS722020ALA330_JK1131YAG93KBV) Disk 1 PASSED (DiskId: ata-ST3750640AS_5QD5ELLF) Disk 2 Spun-Down (DiskId: ata-WDC_WD5000AAKS-00TMA0_WD-WCAPW2595673) Disk 3 PASSED (DiskId: ata-WDC_WD5000AAKS-00TMA0_WD-WCAPW2132942) Disk 4 Spun-Down (DiskId: ata-SAMSUNG_HD753LJ_S13UJ1MQ330294) Disk 5 Not-Reported (DiskId: ata-ST31000333AS_9TE088TH) Disk 6 Spun-Down (DiskId: ata-ST31500341AS_6VS04GWN) Disk 7 Spun-Down (DiskId: ata-Hitachi_HDS722020ALA330_JK1121YAGABMMS) Below are the pics and syslog. syslog.txt
January 19, 201016 yr Author OK, so a little update on my "problem" I stopped the server, unassigned the drive, restarted the server. Everything seemed fine when it came back up just fine and I reassigned the drive and started the party rebuild. Well... it got about 3 minutes in and the drive started throwing errors all over the place. I decided to try one more time but when the server restarted this time the drive was not even recognized. In the syslog it looked like it was trying to be setup but it just would not stay connected. I changed the SATA cable and the power adapter on it while I was at it (even used a different port on the motherboard) and it still did not come back up. This is the SECOND replacement drive I have gotten from seagate that has failed in about a months time. They will be getting a call in the morning and I will be giving them an earful. This is just getting annoying and out of control!! They have sent me "refub" drives each time but I am not going to get another refurb for the next one.
February 25, 201016 yr Howdy, I've been out of touch here awhile, so my comments this late will probably be useless ... I've seen this several times, and it is always hard to diagnose definitively. The disabling of a drive by the low level kernel module always seems to be fatal, and what makes it worse is that unRAID is not directly informed of the loss of the drive. At 04:08:10 of this syslog piece, the first exception occurs, and only 14 seconds later the drive is disabled. Hard resets were tried and at first the SATA link was working, at full speed, but full communications are not successful. Then even the SATA link goes down, and it gives up on the drive and marks it disabled. The drive is still there, and sporadically responds partially, but it take over 35 minutes before it is even able to identify the drive. Probably because sdj is still in use (by unRAID associating it with Disk 5), it assigns a new device ID sdl, but this is useless from unRAID's point of view, because unRAID is completely unaware of this assignment. Even if the drive was able to be completely mounted, unRAID would not know about it, and would continue trying to contact sdj. Any attempt at that point to use Disk 5 would have then resulted in apparent drive errors for it and a red ball. The most likely cause to me is something wrong with the interface to the drive, such as the cables, the power, or the port, but you very knowledgeably changed all 3 of those, so I think you were right that something is wrong with the drive. Perhaps corrupted firmware? Perhaps a bent pin or lead somewhere? The way it keeps connecting and then disappearing sounds like a bad backplane or cable connection, with serious vibration causing disconnections and reconnections. Perhaps an electrical issue with the connectors or circuit board of the drive itself...
February 25, 201016 yr Author Howdy, I've been out of touch here awhile, so my comments this late will probably be useless ... I've seen this several times, and it is always hard to diagnose definitively. The disabling of a drive by the low level kernel module always seems to be fatal, and what makes it worse is that unRAID is not directly informed of the loss of the drive. At 04:08:10 of this syslog piece, the first exception occurs, and only 14 seconds later the drive is disabled. Hard resets were tried and at first the SATA link was working, at full speed, but full communications are not successful. Then even the SATA link goes down, and it gives up on the drive and marks it disabled. The drive is still there, and sporadically responds partially, but it take over 35 minutes before it is even able to identify the drive. Probably because sdj is still in use (by unRAID associating it with Disk 5), it assigns a new device ID sdl, but this is useless from unRAID's point of view, because unRAID is completely unaware of this assignment. Even if the drive was able to be completely mounted, unRAID would not know about it, and would continue trying to contact sdj. Any attempt at that point to use Disk 5 would have then resulted in apparent drive errors for it and a red ball. The most likely cause to me is something wrong with the interface to the drive, such as the cables, the power, or the port, but you very knowledgeably changed all 3 of those, so I think you were right that something is wrong with the drive. Perhaps corrupted firmware? Perhaps a bent pin or lead somewhere? The way it keeps connecting and then disappearing sounds like a bad backplane or cable connection, with serious vibration causing disconnections and reconnections. Perhaps an electrical issue with the connectors or circuit board of the drive itself... Thanks for the input RobJ, your opinion is always welcome. I don't have any backplans in the system and all of my drives are secure in these and can't move at all. I went as far as trying it on a completely new motherboard with all new parts and using it as the only disk in the system. It mounted fine when it first started up but then proceeded to do the exact same thing on this completely new system. I did end up getting a replacement drive for this one that was NEW after I bitched with them for nearly 30 minutes. They so badly wanted to send me another refub drive AND charge me for the replacement. I have run quite a few preclears on this new drive and so far nothing. I think I will run another 5 cycles or so just to try and break it. The sad part about all of this... I had one of my Seagate 1.5TB drives do the exact same thing. It comes online on first boot, stays around for a little while, and then continually tries to reset and recover. I unplug the drive from the server and everything works fine. I am going to pull it soon and see if I can get it to be recognized via a USB connection. I doubt it will work and I will have to send it back in.
Archived
This topic is now archived and is closed to further replies.