anthropoidape Posted February 12, 2012 Share Posted February 12, 2012 Hi, I started having weird stability problems a little while ago, but haven't really had time to do much about it. Basically, failed writes of large files, sometimes but not always. I didn't have any signs of a problem in my syslog so I assumed it was a fault at my desktop's end of things. Messing around a bit I noticed that it happend when accessing the unraid server via user shares, but not via disk shares. In other words if I pasted a file to //tower/disk5/videos/ ... no problem. But if I pasted to a user share, craaash. As of yesterday some syslog data has started appearing... lots of it. Basically the following, repeated over and over and over: Feb 12 13:37:39 Lemur kernel: ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Feb 12 13:37:39 Lemur kernel: ata7.00: BMDMA2 stat 0x80d1009 Feb 12 13:37:39 Lemur kernel: ata7.00: failed command: READ DMA EXT Feb 12 13:37:39 Lemur kernel: ata7.00: cmd 25/00:08:58:12:c2/00:00:c8:00:00/e0 tag 0 dma 4096 in Feb 12 13:37:39 Lemur kernel: res 51/40:08:58:12:c2/00:00:c8:00:00/f0 Emask 0x9 (media error) Feb 12 13:37:39 Lemur kernel: ata7.00: status: { DRDY ERR } Feb 12 13:37:39 Lemur kernel: ata7.00: error: { UNC } Feb 12 13:37:39 Lemur kernel: ata7.00: configured for UDMA/100 Feb 12 13:37:39 Lemur kernel: ata7: EH complete SMART test results: SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 4680 25953144 # 2 Short offline Aborted by host 90% 4680 - # 3 Extended offline Completed: read failure 90% 4678 54840184 # 4 Short offline Completed without error 00% 4430 - Running unRAID Pro v 4.7 Clearly (I think!) this is a problem with disk7 on my server. Can I make any reliable assumptions about what the problem is? Specifically, should I assume it's the HDD itself (it's actually the newest disk in the system, from about August last year), and needs replacing, or should I troubleshoot more such as by replacing cables etc? Basically I don't want to fork for a new drive if I don't need to, but equally I don't want to mess around looking for phantom problems if it is definitely or almost definitely the drive itself. Any suggestions? Thanks, Jason Quote Link to comment
Joe L. Posted February 12, 2012 Share Posted February 12, 2012 ata7 is not necessarilly disk7. Since you did not post a complete syslog, we really cannot tell. Media errors are un-readable sectors on a physical disk. They have nothing to do with user-share or direct disk access. A "smartctl -a /dev/sdX" on each of your disks will probably show one (or more) with sectors pending re-allocation, or already re-allocated. You can then evaluate the true health of your drives. Joe L. Quote Link to comment
anthropoidape Posted February 12, 2012 Author Share Posted February 12, 2012 ata7 is not necessarilly disk7. Since you did not post a complete syslog, we really cannot tell. Media errors are un-readable sectors on a physical disk. They have nothing to do with user-share or direct disk access. A "smartctl -a /dev/sdX" on each of your disks will probably show one (or more) with sectors pending re-allocation, or already re-allocated. You can then evaluate the true health of your drives. Joe L. Thanks Joe. Sorry it took me a while to find the info on copying a syslog. I think I've attached one now. With regards to pending sectors, the unraid smart report facility says: Disk 0/SDA: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 1 Disk 1/SDB: 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 Disk 2/SDD: 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 Disk 3/SDC: 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 Disk 4/SDE: 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 Disk 5/SDF: 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 Disk 6/SDH: 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 Disk 7/SDG: 197 Current_Pending_Sector 0x0032 198 198 000 Old_age Always - 680 I now see that disk 7 is a bit of an outlier in this respect. Am I right to think that the 680 is a bad number? BTW the reason I thought the user share vs disk share thing was relevant is just that it is what prompted me to think that I had an issue with a specific disk (the one automatically chosen by unraid), rather than some other kind of problem causing my file transfers to fail. Then I found with some experimenting that disk 7 seemed to be the culprit. The drive in question is "eligible for replacement" according to WD's warranty website. Advice welcome, this is all outside my expertise. If I have provided the wrong info or provided it the wrong way it's unintentional. Thanks, Jason syslog.txt Quote Link to comment
Joe L. Posted February 12, 2012 Share Posted February 12, 2012 As you suspect, that disk is starting to fail with many un-readable sectors. The syslog confirms that ata7 is /dev/sdg Feb 12 18:21:23 Lemur kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Feb 12 18:21:23 Lemur kernel: ata7.00: ATA-8: WDC WD20EARS-00MVWB0, 51.0AB51, max UDMA/133 Feb 12 18:21:23 Lemur kernel: ata7.00: 3907029168 sectors, multi 16: LBA48 NCQ (depth 0/32) Feb 12 18:21:23 Lemur kernel: ata7.00: configured for UDMA/100 Feb 12 18:21:23 Lemur kernel: scsi 6:0:0:0: Direct-Access ATA WDC WD20EARS-00M 51.0 PQ: 0 ANSI: 5 Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB) Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Write Protect is off Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Mode Sense: 00 3a 00 00 Feb 12 18:21:23 Lemur kernel: sd 6:0:0:0: [sdg] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA and that /dev/sdg is assigned as disk7 Feb 12 18:21:23 Lemur kernel: md: import disk7: [8,96] (sdg) WDC WD20EARS-00M WD-WCAZA4789784 size: 1953514552 At this point you should replace that disk as soon as possible. You should NOT parfoem a parity sync, as it would over-write the existing parity with the zeros sent from the disk when un-readable sectors exist. It is almost safer to un-assign that disk (disk7) and let unRAID simulate it from parity and the other disks until you can install a replacement. That would guarantee that a parity sync will not occur. Joe L. Quote Link to comment
anthropoidape Posted February 12, 2012 Author Share Posted February 12, 2012 Thanks Joe, I have just disabled the drive in question. It seemed like unRAID struggled to even unmount it; it took a long time. Now the whole system is running more smoothly with it "not installed". I should be able to get it replaced by WD as I only bought it last August, but I will get another drive as well while I wait for the replacement. Thank you for the help. Jason Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.