January 2, 200719 yr So I've been copying data to my newly constructed tower, with the following drives: 1) 500G WD Sata (parity) 2) 500G WD Sata (data) 3) 500G Seagate Pata (data) 4) 500G Seagate Pata (data) 5) 500G WD Sata (data) Drives 1-4 are on motherboard (ASUS P5PE-VM). Drive 5 is on PROMISE SATA300 TX4 PCI SATA Controller Card . All 5 data drives are now full, and I never had any errors reported on the final column of the "Tower Main page" I just added drive #6, which is also a 500G WD Sata (onto PROMISE card). It was successful in recognizing, and formatted. I was copying about 70G to this new drive, and noticed some "Errors" creaping up on the tower main page... I looked in /var/log, which indicated a couple of READ errors. After I had about 60G copied over, the unRaid main page was showing 5 errors. So I cancelled the copy. I also deleted the data from the target drive (so now my 6th drive is back to empty). So my questions are as follows... Is this normal? And what should I do? I'm running a parity check now (not sure what this actually does ) Potential problem areas: 1) Bad drive? I can go to the store and exchange it for a new one, but am unsure of the process of replacing a bad drive in unRaid-especially with not data on it. I was just going to stop the array, and put the replacement drive in, then go through the process if this was a brand new ADDITIONAL drive addition. Would this work? I don't want to corrupt the data on drives 1-4. 2) Bad power? The SPARKLE 350W supply I am using is powering 5 drives, the Coolermaster stacker fan, and 2 additional fans that are on two 4-for-3 modules. I had to use a y-adapter to connect the 6th drive, and hope this is not a problem. 3) Bad promise card? Any advice would be appreciated, as I am going to hold off copying any more data until I get this resolved. Thanks!
January 3, 200719 yr I'm assuming you're at a state where all 6 disks are installed and parity is valid (though the new disk is empty of any data because you deleted it). Go to the Settings page and click on the Clear Statistics button in the Disk Settings section. Then go back to the Main page and start a parity-check. This will read the entire surface of all the data drives, generate parity and compare to the parity disk. Any errors that occur in the Errors column will represent Read errors on those disks. You should not get any Sync errors. If the error counter increments during this process, you should cancel it. Power down and check cabling (power and data), power up and check again. If error persists, let me know and we can try a few other things to isolate the problem. As you correctly point out it could be the disk, the power, or the controller card (or cable).
January 4, 200719 yr Author As suggested, I cleared the stats, and reran parity check. All drives covered with no errors, and no sync errors. So I checked out SATA cables, and replaced the one to the drive in question. I am also swapped a power connector to the new drive. Feeling lucky, I attempted to copy a block of about 33G to the "problematic" new disk... no errors. So about an hour later I copied another 30G to the new disk... no errors. A few more hours later, I tried to copy another 20G... then a read error popped up. It is still copying, but the end of my syslog looks like this: Jan 4 03:40:34 Tower ifplugd(eth0)[945]: Executing '/etc/ifplugd/ifplugd.action eth0 up'. Jan 4 03:40:34 Tower kernel: md: reading superblock from /boot/config/super.dat Jan 4 03:40:34 Tower kernel: md: superblock events: 33 Jan 4 03:40:34 Tower kernel: md: import hdb ST3500641A 3PM103R2 offset: 63 size: 488386552 Jan 4 03:40:34 Tower kernel: md0: wrong Jan 4 03:40:34 Tower kernel: md: import hda ST3500641A 3PM17RN6 offset: 63 size: 488386552 Jan 4 03:40:34 Tower kernel: md1: wrong Jan 4 03:40:34 Tower kernel: md: import sdb WDC WD5000KS-00M WD-WMANU1659660 offset: 63 size: 488386552 Jan 4 03:40:34 Tower kernel: md2: wrong Jan 4 03:40:34 Tower kernel: md: import sdc WDC WD5000KS-00M WD-WMANU1700723 offset: 63 size: 488386552 Jan 4 03:40:34 Tower kernel: md3: wrong Jan 4 03:40:34 Tower kernel: md: import sdd WDC WD5000KS-00M WD-WCANU1475608 offset: 63 size: 488386552 Jan 4 03:40:34 Tower kernel: md4: wrong Jan 4 03:40:34 Tower emhttp[937]: Device inventory: Jan 4 03:40:34 Tower kernel: md: import sde WDC WD5000KS-00M WD-WMANU1719789 offset: 63 size: 488386552 Jan 4 03:40:34 Tower kernel: md5: wrong Jan 4 03:40:34 Tower emhttp[937]: /dev/hdb (hdb1) PATA:ST3500641A/3PM103R2 Jan 4 03:40:34 Tower emhttp[937]: /dev/hda (hda1) PATA:ST3500641A/3PM17RN6 Jan 4 03:40:34 Tower emhttp[937]: /dev/scsi/sdh2-2c0i0l0 (sdb1) SATA:WDC WD5000KS-00M/WD-WMANU1659660 Jan 4 03:40:34 Tower emhttp[937]: /dev/scsi/sdh4-4c0i0l0 (sdc1) SATA:WDC WD5000KS-00M/WD-WMANU1700723 Jan 4 03:40:34 Tower emhttp[937]: /dev/scsi/sdh5-5c0i0l0 (sdd1) SATA:WDC WD5000KS-00M/WD-WCANU1475608 Jan 4 03:40:34 Tower emhttp[937]: /dev/scsi/sdh6-6c0i0l0 (sde1) SATA:WDC WD5000KS-00M/WD-WMANU1719789 Jan 4 03:40:34 Tower emhttp[937]: shcmd (3): rmmod md-mod >>/var/log/go 2>&1 Jan 4 03:40:34 Tower kernel: md: unraid driver removed. Jan 4 03:40:34 Tower emhttp[937]: shcmd (4): modprobe md-mod super=/boot/config/super.dat slots=8,48,8,64,3,0,3,64,8,32 ,8,16,0,0,0,0,0,0,0,0,0,0,0,0 >>/var/log/go 2>&1 Jan 4 03:40:35 Tower ifplugd(eth0)[945]: Program executed successfully. Jan 4 03:40:35 Tower kernel: md: unraid driver 0.91.0 Jan 4 03:40:35 Tower kernel: md: sizeof(mdp_super_t) = 4096 Jan 4 03:40:35 Tower kernel: md: measuring xor speed Jan 4 03:40:35 Tower kernel: 8regs : 3383.600 MB/sec Jan 4 03:40:35 Tower kernel: 32regs : 2198.800 MB/sec Jan 4 03:40:35 Tower kernel: pIII_sse : 5167.200 MB/sec Jan 4 03:40:35 Tower kernel: pII_mmx : 3219.200 MB/sec Jan 4 03:40:35 Tower kernel: p5_mmx : 3247.200 MB/sec Jan 4 03:40:35 Tower kernel: md: using function: pIII_sse (5167.200 MB/sec) Jan 4 03:40:35 Tower kernel: md: superblock -> /boot/config/super.dat Jan 4 03:40:35 Tower kernel: md: slot 0 device -> sdd [8,48] Jan 4 03:40:35 Tower kernel: md: slot 1 device -> sde [8,64] Jan 4 03:40:35 Tower kernel: md: slot 2 device -> hda [3,0] Jan 4 03:40:35 Tower kernel: md: slot 3 device -> hdb [3,64] Jan 4 03:40:35 Tower kernel: md: slot 4 device -> sdc [8,32] Jan 4 03:40:35 Tower kernel: md: slot 5 device -> sdb [8,16] Jan 4 03:40:35 Tower kernel: md: slot 6 device -> ? [0,0] Jan 4 03:40:35 Tower kernel: md: slot 7 device -> ? [0,0] Jan 4 03:40:35 Tower kernel: md: slot 8 device -> ? [0,0] Jan 4 03:40:35 Tower kernel: md: slot 9 device -> ? [0,0] Jan 4 03:40:35 Tower kernel: md: slot 10 device -> ? [0,0] Jan 4 03:40:35 Tower kernel: md: slot 11 device -> ? [0,0] Jan 4 03:40:35 Tower kernel: md: reading superblock from /boot/config/super.dat Jan 4 03:40:35 Tower kernel: md: superblock events: 33 Jan 4 03:40:35 Tower kernel: md: import sdd WDC WD5000KS-00M WD-WCANU1475608 offset: 63 size: 488386552 Jan 4 03:40:35 Tower kernel: md: import sde WDC WD5000KS-00M WD-WMANU1719789 offset: 63 size: 488386552 Jan 4 03:40:35 Tower kernel: md: import hda ST3500641A 3PM17RN6 offset: 63 size: 488386552 Jan 4 03:40:35 Tower kernel: md: import hdb ST3500641A 3PM103R2 offset: 63 size: 488386552 Jan 4 03:40:35 Tower kernel: md: import sdc WDC WD5000KS-00M WD-WMANU1700723 offset: 63 size: 488386552 Jan 4 03:40:35 Tower kernel: md: import sdb WDC WD5000KS-00M WD-WMANU1659660 offset: 63 size: 488386552 Jan 4 03:40:35 Tower emhttp[937]: shcmd (5): killall -w smbd nmbd Jan 4 03:40:35 Tower nmbd[930]: [2007/01/04 03:40:35, 0] nmbd/nmbd.c:terminate(56) Jan 4 03:40:35 Tower nmbd[930]: Got SIGTERM: going down... Jan 4 03:40:37 Tower emhttp[937]: Scanning user shares... Jan 4 03:40:37 Tower emhttp[937]: shcmd (6): rm -r /mnt/user/* 2>/dev/null Jan 4 03:40:37 Tower emhttp[937]: shcmd (7): /usr/sbin/nmbd -D Jan 4 03:40:37 Tower emhttp[937]: shcmd (: /usr/sbin/smbd -D Jan 4 03:40:37 Tower emhttp[995]: driver cmd: start STOPPED Jan 4 03:40:37 Tower kernel: mdcmd (2): start Jan 4 03:40:37 Tower kernel: md: reading superblock from /boot/config/super.dat Jan 4 03:40:37 Tower kernel: md: superblock events: 33 Jan 4 03:40:37 Tower kernel: md: import sdd WDC WD5000KS-00M WD-WCANU1475608 offset: 63 size: 488386552 Jan 4 03:40:37 Tower kernel: md: import sde WDC WD5000KS-00M WD-WMANU1719789 offset: 63 size: 488386552 Jan 4 03:40:37 Tower kernel: md: import hda ST3500641A 3PM17RN6 offset: 63 size: 488386552 Jan 4 03:40:37 Tower kernel: md: import hdb ST3500641A 3PM103R2 offset: 63 size: 488386552 Jan 4 03:40:37 Tower kernel: md: import sdc WDC WD5000KS-00M WD-WMANU1700723 offset: 63 size: 488386552 Jan 4 03:40:37 Tower kernel: md: import sdb WDC WD5000KS-00M WD-WMANU1659660 offset: 63 size: 488386552 Jan 4 03:40:37 Tower kernel: unraid: allocated 6370kB for md0 Jan 4 03:40:37 Tower kernel: md1: running, size: 488386552 blocks Jan 4 03:40:37 Tower kernel: md2: running, size: 488386552 blocks Jan 4 03:40:37 Tower kernel: md3: running, size: 488386552 blocks Jan 4 03:40:37 Tower kernel: md4: running, size: 488386552 blocks Jan 4 03:40:37 Tower kernel: md5: running, size: 488386552 blocks Jan 4 03:40:37 Tower emhttp[995]: driver cmd: check Jan 4 03:40:37 Tower kernel: mdcmd (4): check Jan 4 03:40:37 Tower kernel: md: recovery thread got woken up ... Jan 4 03:40:37 Tower kernel: md: writing superblock to /boot/config/super.dat Jan 4 03:40:37 Tower emhttp[1004]: shcmd (9): mount -t reiserfs -o noatime,nodiratime /dev/md1 /mnt/disk1 Jan 4 03:40:37 Tower emhttp[1005]: shcmd (9): mount -t reiserfs -o noatime,nodiratime /dev/md2 /mnt/disk2 Jan 4 03:40:37 Tower emhttp[1006]: shcmd (9): mount -t reiserfs -o noatime,nodiratime /dev/md3 /mnt/disk3 Jan 4 03:40:37 Tower kernel: reiserfs: found format "3.6" with standard journal Jan 4 03:40:37 Tower kernel: md: recovery thread has nothing to resync Jan 4 03:40:37 Tower emhttp[1007]: shcmd (9): mount -t reiserfs -o noatime,nodiratime /dev/md4 /mnt/disk4 Jan 4 03:40:37 Tower emhttp[1008]: shcmd (9): mount -t reiserfs -o noatime,nodiratime /dev/md5 /mnt/disk5 Jan 4 03:40:42 Tower kernel: reiserfs: found format "3.6" with standard journal Jan 4 03:40:42 Tower last message repeated 3 times Jan 4 03:41:26 Tower kernel: reiserfs: checking transaction log (device md(9,4)) ... Jan 4 03:41:26 Tower kernel: for (md(9,4)) Jan 4 03:41:26 Tower kernel: reiserfs: checking transaction log (device md(9,5)) ... Jan 4 03:41:26 Tower kernel: for (md(9,5)) Jan 4 03:41:26 Tower kernel: reiserfs: checking transaction log (device md(9,1)) ... Jan 4 03:41:26 Tower kernel: for (md(9,1)) Jan 4 03:41:26 Tower kernel: md(9,5):Using r5 hash to sort names Jan 4 03:41:26 Tower emhttp[1008]: remount: /dev/md5 Jan 4 03:41:26 Tower kernel: md(9,5):can't shrink filesystem on-line Jan 4 03:41:26 Tower kernel: md(9,4):Using r5 hash to sort names Jan 4 03:41:26 Tower kernel: md(9,1):Using r5 hash to sort names Jan 4 03:41:26 Tower emhttp[1007]: remount: /dev/md4 Jan 4 03:41:26 Tower kernel: md(9,4):can't shrink filesystem on-line Jan 4 03:41:26 Tower emhttp[1004]: remount: /dev/md1 Jan 4 03:41:26 Tower kernel: md(9,1):can't shrink filesystem on-line Jan 4 03:41:29 Tower kernel: reiserfs: checking transaction log (device md(9,3)) ... Jan 4 03:41:29 Tower kernel: for (md(9,3)) Jan 4 03:41:29 Tower kernel: reiserfs: checking transaction log (device md(9,2)) ... Jan 4 03:41:29 Tower kernel: for (md(9,2)) Jan 4 03:41:29 Tower kernel: md(9,3):Using r5 hash to sort names Jan 4 03:41:29 Tower emhttp[1006]: remount: /dev/md3 Jan 4 03:41:29 Tower kernel: md(9,3):can't shrink filesystem on-line Jan 4 03:41:29 Tower kernel: md(9,2):Using r5 hash to sort names Jan 4 03:41:29 Tower emhttp[1005]: remount: /dev/md2 Jan 4 03:41:29 Tower kernel: md(9,2):can't shrink filesystem on-line Jan 4 03:41:29 Tower emhttp[995]: shcmd (9): killall -w smbd nmbd Jan 4 03:41:29 Tower nmbd[991]: [2007/01/04 03:41:29, 0] nmbd/nmbd.c:terminate(56) Jan 4 03:41:29 Tower nmbd[991]: Got SIGTERM: going down... Jan 4 03:41:31 Tower emhttp[995]: Scanning user shares... Jan 4 03:41:31 Tower emhttp[995]: shcmd (10): rm -r /mnt/user/* 2>/dev/null Jan 4 03:41:32 Tower emhttp[995]: user share: Movies Jan 4 03:41:32 Tower emhttp[995]: user share: Musaac Jan 4 03:41:32 Tower emhttp[995]: shcmd (11): /usr/sbin/nmbd -D Jan 4 03:41:32 Tower emhttp[995]: shcmd (12): /usr/sbin/smbd -D Jan 4 04:42:38 Tower emhttp[1014]: shcmd (13): /usr/sbin/hdparm -y /dev/sde >/dev/null Jan 4 04:42:38 Tower emhttp[1014]: shcmd (14): /usr/sbin/hdparm -y /dev/hda >/dev/null Jan 4 04:42:38 Tower emhttp[1014]: shcmd (15): /usr/sbin/hdparm -y /dev/hdb >/dev/null Jan 4 04:42:38 Tower emhttp[1014]: shcmd (16): /usr/sbin/hdparm -y /dev/sdc >/dev/null Jan 4 06:03:39 Tower emhttp[1014]: shcmd (17): /usr/sbin/hdparm -y /dev/sdd >/dev/null Jan 4 06:03:39 Tower emhttp[1014]: shcmd (18): /usr/sbin/hdparm -y /dev/sdb >/dev/null Jan 4 09:11:39 Tower emhttp[1014]: shcmd (19): /usr/sbin/hdparm -y /dev/sdd >/dev/null Jan 4 09:11:39 Tower emhttp[1014]: shcmd (20): /usr/sbin/hdparm -y /dev/sdb >/dev/null Jan 4 11:30:36 Tower kernel: ata2: status=0x50 { DriveReady SeekComplete } Jan 4 11:30:36 Tower kernel: SCSI disk error : host 2 channel 0 id 0 lun 0 return code = 8000002 Jan 4 11:30:36 Tower kernel: Current sd08:11: sns = 70 0 Jan 4 11:30:36 Tower kernel: Raw sense data:0x70 0x00 0x00 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 Jan 4 11:30:36 Tower kernel: I/O error: dev 08:11, sector 272767816 Jan 4 11:30:36 Tower kernel: md5: read error! Jan 4 11:30:36 Tower kernel: end_read_request 272767816/5, count: 2, uptodate 0. Any pointers? (Also disregard the time stamps, I don't have the time set correctly in unRaid... this happened today, not tomorrow ) Thanks!
January 4, 200719 yr Looks like that disk indeed is flaky. One thing you could try is to swap it with the other disk on the Promise controller, or with one of the m/b disks (just swap the data cables). When you Start the array, the system will see that two disks are swapped around and if you go ahead and Start, it will record the new positions in bring up the array. Now do your testing again to see if the errors follow the hard disk, or stay on that controller port & you'll have your answer.
January 4, 200719 yr Author So I swapped Data Discs 4 and 5 (These are still connected to the Promise controller, I just swapped the cables). I did a block copy of about 80G to the flaky disk in question (now disk 4). About 40G in I got a read error. So I cancelled it, then tried to copy the same 80G to disk 5. I received a read error there as well. Thus this is not isolated to the single new disk. The error now appeared on a disk which I previously wrote 300G to with no errors, when it was attached to the Promise controller card all by itself. It seems that the errors are appearing only with the additional load of the new drive. The only common component between the two discs is the Promise controller and the Power supply. Since I already successfully wrote 300G to the *other* disk that now produced a read error, I can only assume that the power supply is causing the problem? I am using SPARKLE FSP350-60PN ATX 350W http://www.newegg.com/Product/Product.asp?Item=N82E16817103486 I did read another post that had virtually identical read errors listed in his syslog (dsroot). Looks like his problems went away with upgraded power supply. Any recommendation? Thanks!
January 5, 200719 yr Author ... and the saga continues.... I swapped the "suspect drive" for a SATA drive on the MOBO. I moved 90G of data to it with NO ERRORS. Now the commonality between errors is Power supply, or Promise controller. The fact the the suspect drive showed no errors on MOBO now makes me think it is the Prmoise card. So my questions are as follows: 1) Is this PS "marginal" for this application, or is it a shoe-in? Connected to it are 4 SATA drives, 2 PATA drives, 2 coolmaster 3-to-4 fans, coolermaster stacker top and rear fans, MOBO and cpu fan. I don't know how much juice the fans draw. 2) I just noticed I have the Promise SATA controller card in the 3rd PCI slot (furthest from CPU)... oops... Would this make a difference? Also, is there a way to "update drivers" for this card? I'm just using the card as it came out of the box last week from Newegg. Thanks for the help!
January 5, 200719 yr Author The saga is slowly turning to nightmare. I thought the problem was isolated to Promise controller, as I managed to get a read error on BOTH attached drives (1 error per drive, per about 90G written). As a final test, I swapped drive #5 with a drive #2 on MOBO. The array failed to start, due to a read error on the drive I just put where drive #2 was. So now I've received read errors on 2 drives, when attached to the promise card, and also when one of the same drives was attached to MOBO. I was foced to telnet in and do a manual shutdown. When I started the box back up, it went into a parity sync check. And here's where it gets interesting... as the check was going on, I moved a few of the power cables in the box... apparently this was not well received by the drives attached to them. All of a sudden I got about 25 sync errors, and errors on drive 2!!! (PATA). I stopped the sync, shutdown the machine, AGAIN checked power cables, then restarted. Great... now Disk 2 is marked red. So now I am waiting Disk 2 to be rebuilt, and who knows if the parity information is even good, since my last parity check got stalled in the middle. My enthusiasm is slowly turning to frustration.
January 7, 200719 yr Author Just bought a new 500G drive to replace the one giving me read errors. This one does the same thing. Anyone?
April 18, 200719 yr I think I've run into the same issue. When I switched the new 500gB drive to the motherboard sata instead of the promise sata card the errors stopped. So I'm thinking either the promise card has issues with newer/bigger drives or the promise linux drivers are flawed somehow. The motherboard's sata chipset is made by VIA. I'm on unraid 3.0, maybe 4.0 would be fix it. I think i'd have to get a custom build of it like i did for 3.0 however to test it. If I get around to trying it I'll post back.
April 20, 200719 yr isaacg: As a reference (also for others reading this thread), I noted WannaTheater's post in another thread, indicating that V4.0beta resolved the 'errors' issue discussed in this thread. ie: UPDATE: Preliminary tests are indicating that unRaid 4.0beta2 seems to have taken care of this.... 10s of Gigs written to drives on TX4 and no read errors. Way to go Tom!
April 20, 200719 yr Tried 4.0, dunno if its fixed my error problems yet since I can't get it to boot, see message here: http://lime-technology.com/forum/index.php?topic=646.0
Archived
This topic is now archived and is closed to further replies.