June 29, 200719 yr I seem to be experiencing some problem when the disks have spun down. Initially, I thought it might be caused by some aspect of XBMC, but today I found that the server responded the same way when I accessed the shares via "network places". I don't have a lot of time in with my server - watched about 3 movies from it so far as have been extremely busy setting up a new pc - and other things. Also, I don't want to be using it completely until things are working consistently. It is the basic 3 HDD deal (500gb SATA). While both drives have content, only one drive is currently being used for access (about 135Gb). Everything seems to work perfectly while the HDD is spinning (like after boot-up), but once the drives have spun down problems occur...it's like the disks won't spin up when they are accessed. If I "Stop" the array, I get red indicators across the board - saying things like "wrong disks" and in some cases text that looks like a foreign language. If I reboot, everything is fine again until once again the disks have spun down. I don't remember this being the case when I first set it up (which wasn't that long ago)? Does anyone have any suggestions? I plan a log excursion tomorrow.
June 30, 200719 yr I have a drive behaving in a similar way, since I installed it about a month ago. It is an older Maxtor 6L300R0 300GB IDE drive, that wants to spin down like all of the others, but never comes back up. In fact, after rebooting, not even BIOS/CMOS can see it. Even powering off and waiting 2 or 3 seconds was not long enough. If unRAID spins it down, I have to shut the computer off, and leave it off for about 10 seconds (not sure of the exact minimum seconds required), and then it will come up as if nothing is wrong, except unRAID will demand a rebuild. For now, I have had to disable spin down, but I'm eagerly awaiting the individual control of spindown that's on Tom's Feature Request list, so I can spin the rest of the drives down. Except for this mis-behaving drive, my drives stay off more than 95% of the time, a nice saving on the electric bill. unRAID sends an 'hdparm -y' command to tell a drive to standby, which results in spindown, but it remains alive, maintaining its state with only a watt or 2 of power. It's possible that this Maxtor thinks it received an 'hdparm -Y' command, which tells it to sleep, which is essentially completely turned off, and requires a hard reset or power cycle to restore. Bad firmware is my guess. My problem may not be the same issue as yours, but it does sound similar. I would be interested in seeing your syslog. Are all 3 drives the same model, and do they all turn red after spindown and attempted access? I don't want to hijack your thread, but I would love any ideas for keeping one drive awake, by accessing it every 55 minutes. I could set spindown to an hour again, and just keep this one drive up. It can't be a timed write, because that would wake the parity drive. It can't be a small read, because caching would catch it. A periodic read of a gigabyte file seems excessive (eg. cp OneGigabyteSizedFile.dat /dev/null). Some kind of cron+hdparm macro seems about right, but I have very little experience with Linux, and trial-and-error experimentation may result in more disk rebuilding. Please excuse the intrusion of my request in your thread.
June 30, 200719 yr Author Here some further info. This pic is webpage after trying to write to disk1 This pic is webpage after stopping the array And the syslog device Jul 1 01:15:59 Tower kernel: [ 5501.305199] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.305203] end_request: I/O error, dev sda, sector 975700031 Jul 1 01:15:59 Tower kernel: [ 5501.305208] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.305211] handle_stripe read error: 975699968/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.305213] md2: read error! Jul 1 01:15:59 Tower kernel: [ 5501.305216] handle_stripe read error: 975699968/2, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.305341] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#121962496) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.315164] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.315168] end_request: I/O error, dev sdb, sector 975962175 Jul 1 01:15:59 Tower kernel: [ 5501.315174] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.315176] handle_stripe read error: 975962112/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.325162] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:15:59 Tower kernel: [ 5501.325208] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.325212] end_request: I/O error, dev sda, sector 975962175 Jul 1 01:15:59 Tower kernel: [ 5501.325217] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.325220] handle_stripe read error: 975962112/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.325223] md2: read error! Jul 1 01:15:59 Tower kernel: [ 5501.325225] handle_stripe read error: 975962112/2, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.325356] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#121995264) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.335162] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.335166] end_request: I/O error, dev sdb, sector 976224319 Jul 1 01:15:59 Tower kernel: [ 5501.335171] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.335174] handle_stripe read error: 976224256/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.345159] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:15:59 Tower kernel: [ 5501.345195] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.345199] end_request: I/O error, dev sda, sector 976224319 Jul 1 01:15:59 Tower kernel: [ 5501.345205] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.345207] handle_stripe read error: 976224256/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.345210] md2: read error! Jul 1 01:15:59 Tower kernel: [ 5501.345213] handle_stripe read error: 976224256/2, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.345338] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#122028032) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.355160] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.355164] end_request: I/O error, dev sdb, sector 976486463 Jul 1 01:15:59 Tower kernel: [ 5501.355169] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.355172] handle_stripe read error: 976486400/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.365157] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:15:59 Tower kernel: [ 5501.365193] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.365197] end_request: I/O error, dev sda, sector 976486463 Jul 1 01:15:59 Tower kernel: [ 5501.365202] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.365205] handle_stripe read error: 976486400/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.365207] md2: read error! Jul 1 01:15:59 Tower kernel: [ 5501.365210] handle_stripe read error: 976486400/2, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.365335] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#122060800) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.375159] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.375162] end_request: I/O error, dev sdb, sector 976748607 Jul 1 01:15:59 Tower kernel: [ 5501.375168] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.375171] handle_stripe read error: 976748544/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.385170] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:15:59 Tower kernel: [ 5501.385208] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.385212] end_request: I/O error, dev sda, sector 976748607 Jul 1 01:15:59 Tower kernel: [ 5501.385217] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.385220] handle_stripe read error: 976748544/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.385222] md2: read error! Jul 1 01:15:59 Tower kernel: [ 5501.385225] handle_stripe read error: 976748544/2, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.385350] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#122093568) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.395157] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.395161] end_request: I/O error, dev sdb, sector 199 Jul 1 01:15:59 Tower kernel: [ 5501.395166] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.395168] handle_stripe read error: 136/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.405154] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:15:59 Tower kernel: [ 5501.405190] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.405194] end_request: I/O error, dev sda, sector 199 Jul 1 01:15:59 Tower kernel: [ 5501.405199] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.405201] handle_stripe read error: 136/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.405204] md2: read error! Jul 1 01:15:59 Tower kernel: [ 5501.405206] handle_stripe read error: 136/2, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.405333] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#17) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.415155] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.415159] end_request: I/O error, dev sdb, sector 262207 Jul 1 01:15:59 Tower kernel: [ 5501.415164] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.415167] handle_stripe read error: 262144/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.425156] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:15:59 Tower kernel: [ 5501.425192] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.425196] end_request: I/O error, dev sda, sector 262207 Jul 1 01:15:59 Tower kernel: [ 5501.425201] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.425204] handle_stripe read error: 262144/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.425206] md2: read error! Jul 1 01:15:59 Tower kernel: [ 5501.425209] handle_stripe read error: 262144/2, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.425334] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#32768) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.425343] ReiserFS: md1: warning: sh-2029: reiserfs_read_bitmap_block: bitmap block (#17) reading failed Jul 1 01:15:59 Tower kernel: [ 5501.435158] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.435161] end_request: I/O error, dev sda, sector 9583 Jul 1 01:15:59 Tower kernel: [ 5501.435168] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:15:59 Tower kernel: [ 5501.435171] end_request: I/O error, dev sdb, sector 9583 Jul 1 01:15:59 Tower kernel: [ 5501.435176] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435179] handle_stripe read error: 9520/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435182] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435184] handle_stripe read error: 9520/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435188] Buffer I/O error on device md1, logical block 1190 Jul 1 01:15:59 Tower kernel: [ 5501.435191] lost page write due to I/O error on md1 Jul 1 01:15:59 Tower kernel: [ 5501.435194] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435197] handle_stripe read error: 9528/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435199] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435202] handle_stripe read error: 9528/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435205] Buffer I/O error on device md1, logical block 1191 Jul 1 01:15:59 Tower kernel: [ 5501.435208] lost page write due to I/O error on md1 Jul 1 01:15:59 Tower kernel: [ 5501.435211] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435213] handle_stripe read error: 9536/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435216] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435218] handle_stripe read error: 9536/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435222] Buffer I/O error on device md1, logical block 1192 Jul 1 01:15:59 Tower kernel: [ 5501.435224] lost page write due to I/O error on md1 Jul 1 01:15:59 Tower kernel: [ 5501.435227] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435230] handle_stripe read error: 9544/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435232] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435235] handle_stripe read error: 9544/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435238] Buffer I/O error on device md1, logical block 1193 Jul 1 01:15:59 Tower kernel: [ 5501.435241] lost page write due to I/O error on md1 Jul 1 01:15:59 Tower kernel: [ 5501.435243] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435246] handle_stripe read error: 9552/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435248] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435251] handle_stripe read error: 9552/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435254] Buffer I/O error on device md1, logical block 1194 Jul 1 01:15:59 Tower kernel: [ 5501.435256] lost page write due to I/O error on md1 Jul 1 01:15:59 Tower kernel: [ 5501.435259] md0: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435261] handle_stripe read error: 9560/0, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435264] md1: read error! Jul 1 01:15:59 Tower kernel: [ 5501.435266] handle_stripe read error: 9560/1, count: 1 Jul 1 01:15:59 Tower kernel: [ 5501.435269] Buffer I/O error on device md1, logical block 1195 Jul 1 01:15:59 Tower kernel: [ 5501.435272] lost page write due to I/O error on md1 Jul 1 01:15:59 Tower kernel: [ 5501.435896] REISERFS: abort (device md1): Journal write error in flush_commit_list Jul 1 01:15:59 Tower kernel: [ 5501.435905] REISERFS: Aborting journal for filesystem on md1 Jul 1 01:16:10 Tower emhttp[1213]: get_temperature: open: No such file or directory Jul 1 01:16:10 Tower emhttp[1211]: smart ioctl: Input/output error Jul 1 01:16:10 Tower emhttp[1212]: smart ioctl: Input/output error Jul 1 01:17:38 Tower emhttp[1216]: get_temperature: open: No such file or directory Jul 1 01:17:38 Tower emhttp[1214]: smart ioctl: Input/output error Jul 1 01:17:38 Tower emhttp[1215]: smart ioctl: Input/output error Jul 1 01:20:08 Tower in.telnetd[1217]: connect from (IP) (IP) Jul 1 01:20:10 Tower login[1218]: ROOT LOGIN on `pts/0' from `(IP)' root@Tower:~#
June 30, 200719 yr Author More log Jul 1 01:27:28 Tower in.telnetd[1233]: connect from `(IP) `(IP) Jul 1 01:27:30 Tower login[1234]: ROOT LOGIN on `pts/0' from ``(IP)' Jul 1 01:27:43 Tower emhttp[1249]: get_temperature: open: No such file or directory Jul 1 01:27:43 Tower emhttp[1247]: smart ioctl: Input/output error Jul 1 01:27:43 Tower emhttp[1248]: smart ioctl: Input/output error Jul 1 01:27:46 Tower emhttp[1024]: shcmd (15): killall -w smbd nmbd Jul 1 01:27:48 Tower emhttp[1024]: shcmd (16): sync Jul 1 01:27:48 Tower kernel: [ 6210.111549] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:27:48 Tower kernel: [ 6210.111622] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:27:48 Tower kernel: [ 6210.111626] end_request: I/O error, dev sda, sector 3999 Jul 1 01:27:48 Tower kernel: [ 6210.111633] md0: read error! Jul 1 01:27:48 Tower kernel: [ 6210.111636] handle_stripe read error: 3936/0, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.111639] md2: read error! Jul 1 01:27:48 Tower kernel: [ 6210.111642] handle_stripe read error: 3936/2, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.111646] Buffer I/O error on device md2, logical block 492 Jul 1 01:27:48 Tower kernel: [ 6210.111648] lost page write due to I/O error on md2 Jul 1 01:27:48 Tower kernel: [ 6210.111652] md0: read error! Jul 1 01:27:48 Tower kernel: [ 6210.111654] handle_stripe read error: 3944/0, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.111657] md2: read error! Jul 1 01:27:48 Tower kernel: [ 6210.111659] handle_stripe read error: 3944/2, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.111663] Buffer I/O error on device md2, logical block 493 Jul 1 01:27:48 Tower kernel: [ 6210.111665] lost page write due to I/O error on md2 Jul 1 01:27:48 Tower kernel: [ 6210.111929] REISERFS: abort (device md2): Journal write error in flush_commit_list Jul 1 01:27:48 Tower kernel: [ 6210.111937] REISERFS: Aborting journal for filesystem on md2 Jul 1 01:27:48 Tower emhttp[1252]: shcmd (17): umount /mnt/disk1 Jul 1 01:27:48 Tower emhttp[1255]: shcmd (17): umount /mnt/disk2 Jul 1 01:27:48 Tower kernel: [ 6210.121550] scsi 3:0:0:0: rejecting I/O to dead device Jul 1 01:27:48 Tower kernel: [ 6210.121621] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:27:48 Tower kernel: [ 6210.121625] end_request: I/O error, dev sda, sector 9639 Jul 1 01:27:48 Tower kernel: [ 6210.121632] sd 1:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:27:48 Tower kernel: [ 6210.121635] end_request: I/O error, dev sda, sector 4023 Jul 1 01:27:48 Tower kernel: [ 6210.121640] sd 2:0:0:0: SCSI error: return code = 0x00040000 Jul 1 01:27:48 Tower kernel: [ 6210.121643] end_request: I/O error, dev sdb, sector 9639 Jul 1 01:27:48 Tower kernel: [ 6210.121649] md0: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121652] handle_stripe read error: 3960/0, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121655] md2: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121657] handle_stripe read error: 3960/2, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121662] Buffer I/O error on device md2, logical block 495 Jul 1 01:27:48 Tower kernel: [ 6210.121665] lost page write due to I/O error on md2 Jul 1 01:27:48 Tower kernel: [ 6210.121669] md0: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121671] handle_stripe read error: 3968/0, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121674] md2: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121676] handle_stripe read error: 3968/2, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121679] Buffer I/O error on device md2, logical block 496 Jul 1 01:27:48 Tower kernel: [ 6210.121682] lost page write due to I/O error on md2 Jul 1 01:27:48 Tower kernel: [ 6210.121685] md0: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121687] handle_stripe read error: 9576/0, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121690] md1: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121692] handle_stripe read error: 9576/1, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121696] Buffer I/O error on device md1, logical block 1197 Jul 1 01:27:48 Tower kernel: [ 6210.121698] lost page write due to I/O error on md1 Jul 1 01:27:48 Tower kernel: [ 6210.121702] md0: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121704] handle_stripe read error: 9584/0, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121707] md1: read error! Jul 1 01:27:48 Tower kernel: [ 6210.121709] handle_stripe read error: 9584/1, count: 1 Jul 1 01:27:48 Tower kernel: [ 6210.121712] Buffer I/O error on device md1, logical block 1198 Jul 1 01:27:48 Tower kernel: [ 6210.121715] lost page write due to I/O error on md1 Jul 1 01:27:48 Tower emhttp[1024]: driver cmd: stop Jul 1 01:27:48 Tower kernel: [ 6210.128536] mdcmd (123): stop Jul 1 01:27:48 Tower kernel: [ 6210.128543] md0: stopping Jul 1 01:27:48 Tower kernel: [ 6210.128544] md1: stopping Jul 1 01:27:48 Tower kernel: [ 6210.128566] md2: stopping Jul 1 01:27:48 Tower kernel: [ 6210.129978] md: writing superblock to /boot/config/super.dat Jul 1 01:27:48 Tower kernel: [ 6210.194300] md0: import [8,0] (sda) Result=ok Jul 1 01:27:48 Tower kernel: [ 6210.194302] per=stop Jul 1 01:27:48 Tower kernel: [ 6210.194303] cmdResult=o offset: 63 size: 488386552 Jul 1 01:27:48 Tower kernel: [ 6210.194309] md0: wrong Jul 1 01:27:48 Tower kernel: [ 6210.194455] md1: import [8,16] (sdb) Result=ok Jul 1 01:27:48 Tower kernel: [ 6210.194456] per=stop Jul 1 01:27:48 Tower kernel: [ 6210.194457] cmdResult=o offset: 63 size: 488386552 Jul 1 01:27:48 Tower kernel: [ 6210.194460] md1: wrong Jul 1 01:27:48 Tower kernel: [ 6210.194465] md2: import: lock_rdev error: -6 Jul 1 01:27:48 Tower kernel: [ 6210.194466] md2: missing Jul 1 01:27:48 Tower emhttp[1260]: smart ioctl: Input/output error Jul 1 01:27:48 Tower emhttp[1261]: smart ioctl: Input/output error Jul 1 01:27:48 Tower emhttp[1024]: Scanning user shares... Jul 1 01:27:48 Tower emhttp[1024]: shcmd (17): rm -r /mnt/user/* 2>/dev/null Jul 1 01:27:48 Tower emhttp[1024]: shcmd (18): /usr/sbin/nmbd -D Jul 1 01:27:48 Tower emhttp[1024]: shcmd (19): /usr/sbin/smbd -D Jul 1 01:27:48 Tower kernel: [ 6210.239533] md0: import [8,0] (sda) ^Aà^Gì\201 D\237.4åW^A\211Ãø offset: 63 size: 488386552 Jul 1 01:27:48 Tower kernel: [ 6210.239540] md0: wrong Jul 1 01:27:48 Tower kernel: [ 6210.239661] md1: import [8,16] (sdb) ^Aà^Gì\201 D\237.4åW^A\211Ãø offset: 63 size: 488386552 Jul 1 01:27:48 Tower kernel: [ 6210.239665] md1: wrong Jul 1 01:27:48 Tower kernel: [ 6210.239669] md2: import: lock_rdev error: -6 Jul 1 01:27:48 Tower kernel: [ 6210.239671] md2: missing Jul 1 01:27:48 Tower emhttp[1269]: smart ioctl: Input/output error Jul 1 01:27:48 Tower emhttp[1270]: smart ioctl: Input/output error Jul 1 01:28:18 Tower kernel: [ 6240.098898] md0: import [8,0] (sda) /browse.dat /cache/samba/browse. offset: 63 size: 488386552 Jul 1 01:28:18 Tower kernel: [ 6240.098905] md0: wrong Jul 1 01:28:18 Tower kernel: [ 6240.099012] md1: import [8,16] (sdb) /browse.dat /cache/samba/browse. offset: 63 size: 488386552 Jul 1 01:28:18 Tower kernel: [ 6240.099017] md1: wrong Jul 1 01:28:18 Tower kernel: [ 6240.099022] md2: import: lock_rdev error: -6 Jul 1 01:28:18 Tower kernel: [ 6240.099025] md2: missing Jul 1 01:29:18 Tower kernel: [ 6300.103505] md0: import [8,0] (sda) offset: 63 size: 488386552 Jul 1 01:29:18 Tower kernel: [ 6300.103510] md0: wrong Jul 1 01:29:18 Tower kernel: [ 6300.103606] md1: import [8,16] (sdb) offset: 63 size: 488386552 Jul 1 01:29:18 Tower kernel: [ 6300.103610] md1: wrong Jul 1 01:29:18 Tower kernel: [ 6300.103615] md2: import: lock_rdev error: -6 Jul 1 01:29:18 Tower kernel: [ 6300.103618] md2: missing Jul 1 01:30:18 Tower kernel: [ 6360.108117] md0: import [8,0] (sda) offset: 63 size: 488386552 Jul 1 01:30:18 Tower kernel: [ 6360.108121] md0: wrong Jul 1 01:30:18 Tower kernel: [ 6360.108192] md1: import [8,16] (sdb) offset: 63 size: 488386552 Jul 1 01:30:18 Tower kernel: [ 6360.108195] md1: wrong Jul 1 01:30:18 Tower kernel: [ 6360.108199] md2: import: lock_rdev error: -6 Jul 1 01:30:18 Tower kernel: [ 6360.108201] md2: missing Jul 1 01:31:18 Tower kernel: [ 6420.122725] md0: import [8,0] (sda) Jul 1 01:31:18 Tower kernel: [ 6420.122726] diskSerial.2=5 500630AS Jul 1 01:31:18 Tower kernel: [ 6420.122728] diskSe offset: 63 size: 488386552 Jul 1 01:31:18 Tower kernel: [ 6420.122734] md0: wrong Jul 1 01:31:18 Tower kernel: [ 6420.122877] md1: import [8,16] (sdb) Jul 1 01:31:18 Tower kernel: [ 6420.122879] diskSerial.2=5 500630AS Jul 1 01:31:18 Tower kernel: [ 6420.122880] diskSe offset: 63 size: 488386552 Jul 1 01:31:18 Tower kernel: [ 6420.122884] md1: wrong Jul 1 01:31:18 Tower kernel: [ 6420.122889] md2: import: lock_rdev error: -6 Jul 1 01:31:18 Tower kernel: [ 6420.122892] md2: missing Jul 1 01:32:18 Tower kernel: [ 6480.127357] md0: import [8,0] (sda) offset: 63 size: 488386552 Jul 1 01:32:18 Tower kernel: [ 6480.127363] md0: wrong Jul 1 01:32:18 Tower kernel: [ 6480.127474] md1: import [8,16] (sdb) offset: 63 size: 488386552 Jul 1 01:32:18 Tower kernel: [ 6480.127479] md1: wrong Jul 1 01:32:18 Tower kernel: [ 6480.127484] md2: import: lock_rdev error: -6 Jul 1 01:32:18 Tower kernel: [ 6480.127486] md2: missing root@Tower:~#
June 30, 200719 yr Wow! That looks worse than mine! http://lime-technology.com/forum/index.php?topic=796.0 (check Disk6 and Disk2, copied from Disk1) This is the third unRAID system within a week reporting corruption in the drive list. I have a few comments, but I'm not an expert, so take them as you wish. The missing drive temperatures seem somewhat significant to me, as those are quite popular Seagate 500GB drives that normally show their SMART data, the temps. When you get the system back up, you might try parsec's cat commands, found here http://lime-technology.com/forum/index.php?topic=754.0. The fact that SMART functionality does not seem to be working makes me wonder if the system is configured optimally? I've read others here, with barely working systems, who made suggested changes in CMOS that resulted in large improvements. For example, SATA drives originally configured as IDE drives, were changed to Enhanced or AHCI in the CMOS. Perhaps if you listed your equipment, at least the motherboard and any drive controllers and how much memory etc, someone with the same or similar might be able to suggest appropriate settings that could improve performance and enable SMART to function right. The syslog parts above start a little too late, well after the failure. They are the cascading errors resulting from the previous failures. We need to see the start of the log and up to and including the errors where everything goes down. The start of the log documents the unRAID hardware detection and setup, which might explain some of the problems by itself. After that, the very first errors associated with failures are the most important of all error messages. Any errors after them may be helpful, but are more likely a result of the problems, not the cause. Most likely, Tom will want to see your entire syslog. If you don't get better advice, I suggest running reiserfsck on disk1 and disk2, 'reiserfsck /dev/sdb1' and 'reiserfsck /dev/sdc1' I believe. I think Disk2 is sdc, but it's not mentioned in the syslog pieces above. Disk2 (md2) is already lost/missing before the first syslog piece. If reiserfsck gives further instructions like --fix-fixable or --rebuild-tree, then do them. This should clear any corruption on the drives.
June 30, 200719 yr Author Wow! That looks worse than mine! http://lime-technology.com/forum/index.php?topic=796.0 (check Disk6 and Disk2, copied from Disk1) This is the third unRAID system within a week reporting corruption in the drive list. I have a few comments, but I'm not an expert, so take them as you wish. The missing drive temperatures seem somewhat significant to me, as those are quite popular Seagate 500GB drives that normally show their SMART data, the temps. When you get the system back up, you might try parsec's cat commands, found here http://lime-technology.com/forum/index.php?topic=754.0. The fact that SMART functionality does not seem to be working makes me wonder if the system is configured optimally? I've read others here, with barely working systems, who made suggested changes in CMOS that resulted in large improvements. For example, SATA drives originally configured as IDE drives, were changed to Enhanced or AHCI in the CMOS. Perhaps if you listed your equipment, at least the motherboard and any drive controllers and how much memory etc, someone with the same or similar might be able to suggest appropriate settings that could improve performance and enable SMART to function right. The syslog parts above start a little too late, well after the failure. They are the cascading errors resulting from the previous failures. We need to see the start of the log and up to and including the errors where everything goes down. The start of the log documents the unRAID hardware detection and setup, which might explain some of the problems by itself. After that, the very first errors associated with failures are the most important of all error messages. Any errors after them may be helpful, but are more likely a result of the problems, not the cause. Most likely, Tom will want to see your entire syslog. If you don't get better advice, I suggest running reiserfsck on disk1 and disk2, 'reiserfsck /dev/sdb1' and 'reiserfsck /dev/sdc1' I believe. I think Disk2 is sdc, but it's not mentioned in the syslog pieces above. Disk2 (md2) is already lost/missing before the first syslog piece. If reiserfsck gives further instructions like --fix-fixable or --rebuild-tree, then do them. This should clear any corruption on the drives. Thx for your thoughts. I am at a loss as to why this suddenly started to happen. Everything was working perfectly. This only happens after the disks have spun down! AS for the missing temps in the above pics, the first is a pic after the drives have spun down - so the temps aren't showing. The 2nd pic doesn't show them because of the fault. Usually the temps show and are quite low (mid 20C) - sitting in a cage with a 120mm fan all of their own. As for the log, that's all the command gave me, except for this last part, which I didn't think was important - like I said, I just don't know. The log posted is from the machine before the fault and then directly after. I have to wait an hour for the drives to spin down before I can retest. A reboot brings the system back to normal (everything looks fine) then drives spin down and doh!!! Is this "reiserfsck" thingee a piece of software or part of the unraid system? Will it hurt the content on the disks? I'm, using a P5B-E MB - it's setup as per Koolkiwi's post about it. Like I said, it has been working perfectly - very strange. I'll try another log capture.
June 30, 200719 yr longshot: Have you changed anything in your bios in regards to powersaving features. Could it be that they are so deep in sleep, that the OS unraid is based on cannot wake them up, or they are not allowed to wake up based on OS requests. /Rene Computing is 50% strict logic, and 50% voodoo. So pour the blood of a goat over the server and reboot.
July 1, 200719 yr Author ...So pour the blood of a goat over the server and reboot. Why hurt a goat? I've been considering getting the water blaster into it and cleaning out the cobwebs I haven't changed anything on the machine since it's inception. It's been working great, although in all honesty I haven't had a lot of time to utilise it because of other projects. In saying this, everytime I used it. it seemed to work flawlessly, so I really don't understand what is happening. I only noticed this occurrence a couple of weeks ago. We started watching a movie from it (via XBMC) and all was well. Finished dinner, then later came back to finish the movie and it spun out (I realised later the drives had spun down). The same has been happening ever since. Perhaps I need an Exorcist? The worse thing is that the fault causes XBMC to spin (seems to think there is suddenly no such movie) and it deletes the movie from the database - real pain as I have to go a set it up again. I'm in the process of getting another set of logs. Hope there is some light at the end of the tunnel. I certainly hope this isn't a reflection of it's reliability.
July 1, 200719 yr Author The new log. Here is the sequence of events Turn on Unraid Server Wait until Discs spin down (you can probably command this but I don't know how) Refresh (to verify disc spindown) Try to copy file (a write/access to initiate fault) to disk1 via "Network Places" Accessing disk 1 via NP causes fault (first time this has happened)(only 1 movie is visible should be 31) System faults Stop array Webpage shows Red Indicators across the board. Reboot System. System shows (via webpage) everything is fine (parity valid)
July 1, 200719 yr Author Question: Disk corruption. Is this corruption in the data (content) on the disc, or the underlying structure of the disk? I guess if the data can still be read, then it must be the underlying nature of the disk. If so, if you copy the data to a new location, does the corruption follow? Probably seems like a stupid question, but I know next to nothing about this sort of thing.
July 2, 200719 yr RobJ suggested listing your equipment. That would be a good start. You may also want to flash the board's BIOS. It sounds like there is a miscommunication of what sleep or standby means somewhere between the SW, the OS, the controller, your board, and the drives. Note: I see this was already surmised above In many boards there is a setting defining how deep the sleep is, perhaps that is worth checking out. Bill
July 3, 200719 yr Wow! That looks worse than mine! http://lime-technology.com/forum/index.php?topic=796.0 (check Disk6 and Disk2, copied from Disk1) This is the third unRAID system within a week reporting corruption in the drive list. I have a few comments, but I'm not an expert, so take them as you wish. The missing drive temperatures seem somewhat significant to me, as those are quite popular Seagate 500GB drives that normally show their SMART data, the temps. When you get the system back up, you might try parsec's cat commands, found here http://lime-technology.com/forum/index.php?topic=754.0. The fact that SMART functionality does not seem to be working makes me wonder if the system is configured optimally? I've read others here, with barely working systems, who made suggested changes in CMOS that resulted in large improvements. For example, SATA drives originally configured as IDE drives, were changed to Enhanced or AHCI in the CMOS. Perhaps if you listed your equipment, at least the motherboard and any drive controllers and how much memory etc, someone with the same or similar might be able to suggest appropriate settings that could improve performance and enable SMART to function right. The syslog parts above start a little too late, well after the failure. They are the cascading errors resulting from the previous failures. We need to see the start of the log and up to and including the errors where everything goes down. The start of the log documents the unRAID hardware detection and setup, which might explain some of the problems by itself. After that, the very first errors associated with failures are the most important of all error messages. Any errors after them may be helpful, but are more likely a result of the problems, not the cause. Most likely, Tom will want to see your entire syslog. If you don't get better advice, I suggest running reiserfsck on disk1 and disk2, 'reiserfsck /dev/sdb1' and 'reiserfsck /dev/sdc1' I believe. I think Disk2 is sdc, but it's not mentioned in the syslog pieces above. Disk2 (md2) is already lost/missing before the first syslog piece. If reiserfsck gives further instructions like --fix-fixable or --rebuild-tree, then do them. This should clear any corruption on the drives. Most interesting is that the screen-print of the drives shows NO serial numbers. This is another symptom of something going very wrong. Without serial numbers to differentiate between the drives, it is impossible for unRaid to know when they change. The garbage names shown when the disks fail all point to corruption of the superblock tracking the drives. It might be that the three instances of recent failures are related to the same bug. One that is not properly reading the serial numbers of the drives... Or, perhaps the serial numbers are longer than expected. Looking at the source code, it appear as if it can handle model numbers up to 40 characters in length and serial numbers up to 20 characters. I don't think the error checking in the following block of code used for SATA drives will catch everything it needs to... There's no check if "count" goes negative. There's no test in the while loop to ensure it does not go past the end of a serial number that returned a string of blanks. There is a test for overflow, but not for "underflow" Tom, got a few minutes to check this trend out.... might be a bug needing your attention. Joe L. data = buf+4; count = buf[3]; /* skip leading white space */ while (*data == ' ') { data++; count--; } /* don't overflow our serial_no field */ if (count > sizeof(rdev->serial_no)) count = sizeof(rdev->serial_no); memcpy( rdev->serial_no, data, count);
July 3, 200719 yr Author Hey....thx guys for taking the time to reply. All info is appreciated as I don't know anything about what the log means Bill - I've added the specs to my sig. I can't currently check to Bios setting because there is no video card installed. I'll have to buy one. However, the bios has not been change/modified in anyway since the machine was setup (unRAID has been working correctly) but checking the bios will verify the settings haven't changed Joe - The drives are all correctly identified with the correct serial numbers on startup - verified in the webpage. The system happily works until the drives have spun down. When I try to access the drives for content the fault occurs - the the system seems to lose it's way and the info. This is then resolved with a re-boot and everything turns to normal - very strange. Please note: I did edit out the HDD serial numbers in the log - replaced with (Drive ID). I forgot to mention this Seems like I'm going to have to get a video card and check out the bios. I'm feeling I might re-build everything (re-format the HDD"s etc) and start from scratch again, as I don't have that much content on the drives. Currently, I've lost a lot of confidence in the ability of the system to maintain and protect my content.
July 3, 200719 yr Author Hey...it's just occured to me I changed out the PSU for a larger one when I upgraded my computer (TT 430W (came with case) swapped out for a TT600W toughpower) Could this be the cause? If so, what would the difference between the two power supplies? I can't believe I forgot this. It would be about the right time frame as well!
July 4, 200719 yr "Bigger" power supply is defined as the total power made available, but storage boxes like unraid really are only looking for the 12V portion. In addition, many companies flat out lie about their PS power ratings. As a result, your 430 could have been pumping out more usable juice than the 600. Bill
July 4, 200719 yr Author "Bigger" power supply is defined as the total power made available, but storage boxes like unraid really are only looking for the 12V portion. In addition, many companies flat out lie about their PS power ratings. As a result, your 430 could have been pumping out more usable juice than the 600. Bill Thx Bill for the help. Not sure what to do currently. I went a brought another video card today to get the unRAID bios viewable. Seems the bios settings haven't changed (as I suspected). As for the PSU - any suggestions what I should do? The larger (600w) one had 6xSATA connectors (3 on two leads) - I had all three drives tied into one lead. I have changed this and put one on another lead. Only as a last resort will I swap the original PSU back in (because it now resides in my new desktop) and I don't really want to pull it all out. Both PSU's are made by the same company so I don't know what could be different? The "toughpower" 600W one is supposedly a better spec, but who knows? Currently waiting for the drives to spin down (I have a question in "Software") about a command to do this - separate issue I think. NOT having to wait an hour would be nice. Hopefully back to normal would be nice, but I don't think so I haven't been that lucky with tech lately - even bigger :(
July 4, 200719 yr Author OH...another question I meant to ask was about the ACPI aspect of unraid. Does unraid control this once settings are made in the bios? I can only assume unraid controls the drive spin down. I had some comments about them going into a really deep sleep state, but is this the same as drives just spinning down? Don't the drives only stop spinning, not go to sleep? I know my box doesn't switch into standby mode (S3) because all the fans keep running, so what is the critical factor in this equation? Just trying to make sense of it all??
July 4, 200719 yr ... Just trying to make sense of it all?? Reading through thread and looking at syslog, symptoms point to bad power, sorry, or possibly bad m/b.
July 5, 200719 yr Author Update. Well...I have some further developments. A drive went bang! It's dead. The circuit area just inside the SATA connector fried. Very nasty So there is a possibility it was this drive all along causing the problems - but I'm not convinced completely. As the drive is only 8 weeks old, I returned it to be told various stories...blah, blah, blah. Long story short, they say it's physical damage and not covered under warranty (so typical). They are also adament it's NOT the drive but the power supply that caused it. (Someone or something else is always to blame!!!!!). Anyway, the retailer is going to return it and see if Seagate will come to the party (A couple of days I was told). I might have a fight on my hands to get a replacement. As I want to know what caused this, I phoned the local Thermaltake distributor and they are going to test my PSU next week. So hopefully, if the PSU tests fine I can point the finger at the drive. (It is currently working) As a consequence of all this, I removed the damaged drive and rebuilt the array with only two drives. After the parity sync, everything is now working fine - even after the drives spin down. Using the "time" adjustment advice worked well. Thx Orb I'm really hoping it was a dud drive so that I know the system is back to full potential. Getting the PSU tested will also help alleviate some of my concerns. Thx for all the help. Can't do much more until next week. :'(
July 5, 200719 yr They won't cover "physical damage" - LOL! As if there is some other kind of damage. Hopefully they weren't trying to insinuate you dropped it. Don't let them off the hook - if you can't get satisfaction, keep pushing. Send an email to maximumpc as they help resolve issues like this. Consider challenging it through your credit card. BTW, the majority of drive failures are due to infant mortality - lots-o-failures up front, then very few failures until 1-2 years out. Bill
July 5, 200718 yr Author They won't cover "physical damage" - LOL! As if there is some other kind of damage. Hopefully they weren't trying to insinuate you dropped it. Bill Well actually there is serious flash residue all over the exposed circuit part of the drive. Something went bang (not just a figure of speech)! I can't push too hard yet as I can't vouch for the PSU. Once that is tested and out of the way I can push back. As far as I'm concerned, the Consumer Guarantee Act here in NZ covers me, but not being a lawyer I guess it's open to interpretation. A page on the Ministry of Commerce web site clearly sets out the Consumers rights for faulty product - but when the cause is questionable who knows. No doubt the retailer feels they will get lumped with the expense? At this point, the discussions are still amicable. I'm taking a "wait and see" response. Not so for my MB return 2 weeks ago. The retailer was shocked I never returned the little plastic bags the MB accessories came in. He implied I should have setup the MB without taking the cables / accessories out of their plastic bags. Then, he tried to charge me a 20% re-stocking fee for returning a faulty MB. Naturally I was dumb-struck. As the argument went on, it seems like he didn't think the MB problem was serious and was going to try and sell the board on as brand new. Don't figure. Finally, after a heated debate, he gave my money back. I'm beginning to believe the computer industry here doesn't think they have to meet their responsibilities like other business branches.
July 5, 200718 yr Otherwise this kind of stuff might be covered by your home insurance, provided it's worth it with the deductible and all.
July 11, 200718 yr Author An Update....... Had the PSU tested today by the local Thermaltake distributor. Everything proved as I suspected - the PSU is fine. I got a new drive on the way back that I installed once home. Currently all is working as it should - including spinning up after it's spun down. As for the old drive (it seems more and more likely this was the cause of all the dramas), I rang the retailer today. Things are looking up. He seemed to think the local Seagate distributor was going to replace it. However, they have shipped it off to Singapore for further testing because there was more damage to the drive than they expected - so have to wait another week to hear.
July 11, 200718 yr Very interesting thread that we have here. I too have a second unRAID server that I built and I am having trouble getting the drives to spin back up. I am going to tail the log here and see if it might be a bad drive. Very interesting about your drive how it went 'bang'... that is crazy!!
Archived
This topic is now archived and is closed to further replies.