Jump to content

MacDaddy

Members
  • Posts

    50
  • Joined

  • Last visited

Posts posted by MacDaddy

  1. Thanks for the suggestion.  I will add attributes 1 and 200 to be monitored as you suggest. 

     

    The parity completed with no errors.  The main screen shows there are 56 errors on the parity disk alone.  All other disks are 0.  I'm assuming these are all the correctable read errors. 

     

    I believe I will use this opportunity to replace the parity with an 8TB option.  This will give me the latitude to begin incrementing the data drives to 8TB as the storage is consumed.  Unless I'm just being hyper paranoid, I'll not add the former parity drive back to the disk pool.  If the risk is is low to the point of non-existent then I'll be willing to add it back in.

  2. I have had a MCE occuring since mid December.  I've ordered some replacement memory that has been delivered and plan to install it tomorrow.  While awaiting the replacement memory, the parity drive alerted for read errors.  Before I reboot following the new memory installation I wanted to add the current diagnostic information.  It appears that the HDD read errors were corrected, but I wanted to ask for help in determining if the HDD read error might possibly be a false positive influenced by the memory errors.  If it is truly failing, I have no problem replacing the drive.  In this case would adding the old parity drive back to the data pool be an unreasonable risk.  Thanks in advance for your advice.

    tower-diagnostics-20220110-1653.zip tower-smart-20220110-1648.zip tower-syslog-20220110-2300.zip

  3. Fix Common Problems alerted me to a previous MCE.  I'm seeing repeating entries for about 5 secs of:

    Oct 18 22:34:33 Tower kernel: mce: [Hardware Error]: Machine check events logged
    Oct 18 22:34:33 Tower kernel: EDAC sbridge MC1: HANDLING MCE MEMORY ERROR
    Oct 18 22:34:33 Tower kernel: EDAC MC1: 1 CE memory read error on CPU_SrcID#1_Ha#0_Chan#3_DIMM#0 (channel:3 slot:0 page:0x5ff304 offset:0xec0 grain:32 syndrome:0x0 -  area:DRAM err_code:0001:0091 socket:1 ha:0 channel_mask:8 rank:0)

    It corrected and I have not seen a repeat event.  Is this a one-time event that bears attention if it should happen again, or do I need to start looking for some new memory?  Thanks in advance for any advice you can offer.

    tower-diagnostics-20211024-1411.zip

  4. Any possibility to add sshpass?

     

    In conjunction with user.scripts I'm hoping to implement something like :

    #!/bin/bash
    #argumentDescription=Enter password and box name (mypass pihole)
    sshpass -p $1 ssh pi@$2.rmac "sudo dd bs=4M if=/dev/mmcblk0 status=progress | gzip -1 - " | dd of=/mnt/user/Backups/$2/$(date +%Y%m%d\_%H%M%S)\_$box.gz

    I run 4 different Raspberry Pi boxes.  They run for a good long time, but I've just had the third SD card fail.  I would like to keep an image where I can recover quickly with minimal pain. 

  5. Thanks for the info. The 5400 drives should in theory be quieter than their 7200 counterparts. Good points on the airflow. Noise and the airflow will go hand in hand. I’ll look in to active cooling on the CPUs.


    Sent from my iPhone using Tapatalk

  6. I have a Supermicro X9DRi-LN4+/X9DR3-LN4+ with dual Xeon® CPU E5-2630L v2 based server for my unRaid build.  It is a surplus server in a Supermicro CSE-835TQ-R920B case.  In my prior residence, I had the luxury of converting one of the closets to house all my equipment.  It was designed for power/ventilation/noise.  I'm now in a place where I can't modify any rooms and the only location to house the equipment is a closet in the master bedroom.  Needless to say, the server sounds like a hoover vacuum with asthma on steroids.  It has served me well and I am thinking to transfer the M/B and 5xWD40EFRX to a silenced case.  I'm thinking something like be quiet! Dark Base 900 https://www.bequiet.com/en/case/697 might work. 

     

    The CPU shows 60W TDP.  Currently they have a passive cooler with the custom air shroud from Supermicro.  I would intend to keep them configured as such.  I would have to change from the redundant power supplies currently in the server chassis.  I would appreciate any potential suggestions in that area.

     

    What is your advice on the noise footprint after the conversion?  Any experience with this case or any other that might accommodate the M/B?  Any thoughts on potential roadblocks I might find?

     

    Thanks in advance for any input.

  7. Thanks for your response. I had a feeling it would go that way. This is my first encounter with corruption.

    When I complete the XFS repair it will prune data (according to the dry run output). Is that data lost for good or will unRaid recognize it and let parity reconstruct?


    Sent from my iPhone using Tapatalk

  8. I'm currently using a docker MakeMKV to write cloned DVD structures in to a MKV container.  I've noticed that a share that I'm using for the output keeps dropping.  I can reboot and the array will start with all drives green and the share is restored.  A snippet from the log is attached.  I can start in maintenance mode and dry run xfs_repair on all the hard drives.  All are clean except md2.

     

    Is it better to xfs_repair the md2 drive or replace with new drive and let it rebuild?  Note-while parity shows valid, it has been more than 700 days since last check.

     

    Oct 13 18:36:15 Tower kernel: XFS (md2): Metadata CRC error detected at xfs_inobt_read_verify+0xd/0x3a [xfs], xfs_inobt block 0x19f754db8 
    Oct 13 18:36:15 Tower kernel: XFS (md2): Unmount and run xfs_repair
    Oct 13 18:36:15 Tower kernel: XFS (md2): First 128 bytes of corrupted metadata buffer:
    Oct 13 18:36:15 Tower kernel: 0000000095cfb836: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: 000000001de8c0f3: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: 00000000c8d99f19: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: 00000000a9a413e7: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: 000000003c326670: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: 000000005abd08ab: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: 000000003867ab1f: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: 0000000085cdd1ba: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    Oct 13 18:36:15 Tower kernel: XFS (md2): metadata I/O error in "xfs_trans_read_buf_map" at daddr 0x19f754db8 len 8 error 74
    Oct 13 18:36:15 Tower kernel: XFS (md2): xfs_do_force_shutdown(0x1) called from line 300 of file fs/xfs/xfs_trans_buf.c.  Return address = 000000007c1ff77b
    Oct 13 18:36:15 Tower kernel: XFS (md2): I/O Error Detected. Shutting down filesystem
    Oct 13 18:36:15 Tower kernel: XFS (md2): Please umount the filesystem and rectify the problem(s)

     

    tower-diagnostics-20201013-2150.zip

  9. I’m resurrecting my unRaid box. It shows the latest update v6.5.3 is available. When I initiate the upgrade it throws an invalid URL/ server error message. Sorry for the pic, I’m on direct terminal.

     

    Is it possible that amazon is down? Or maybe I need an address intermediate step?d43a37cae7514a3e191117153254f968.jpg

     

     

    Sent from my iPhone using Tapatalk

  10. Heads up. I just updated the app on iPhone. Major features work as expected and alignment of tab scrolling is much improved.

    However it seems that the logs no longer display. I did a quick check for a plugin update on the server side. It appears I'm on the latest.

    Is there any diagnostic info I can send that would be helpful?


    Sent from my iPhone using Tapatalk

  11. Thanks so much for all the work on this.  I wanted to send some beer money.  On the Apps tab, there are links for Statistics and Credits in the upper right.  Selecting them brings up the expected response with the familiar PayPal donate button.  It took four attempts to get the donation sent - it terminated many times with a cryptic "fatal error" message.  I think donations are an essential function:-).  Thanks for keeping the unRaid apps world safe.

  12. I've used this script for reclaiming two of my older 1TB disks from other equipment in to my unRaid server.  It was great to still be able to use the existing array with all of my normal shares still in a protected state while the new drives were clearing.  I am now in the process of preclearing a 2TB WD EARS drive and got some interesting syslog entries.  I'm a Linus newb, so would appreciate any insight on this.  By the way, this is on one of the Limetech MD1510 machines and this particular disk is the first (and only) on the second SAS.

     

    Right after the login at 23:21:36 I kicked off the preclear and went to bed.  The hard resetting of the link worries me. 

     

    Jun  8 23:19:47 Tower unmenu-status: Starting unmenu web-server

    Jun  8 23:21:36 Tower login[3702]: ROOT LOGIN  on `tty1'

    Jun  8 23:22:24 Tower kernel:  sda: unknown partition table

    Jun  8 23:47:56 Tower sSMTP[7314]: Creating SSL connection to host

    Jun  8 23:47:57 Tower sSMTP[7314]: SSL connection using DHE-RSA-AES256-SHA

    Jun  8 23:47:58 Tower sSMTP[7314]: Sent mail for root@localhost (221 2.0.0 omta04.emeryville.ca.mail.comcast.net comcast closing connection)

    Jun  8 23:51:52 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

    Jun  8 23:51:52 Tower kernel: ata4.00: failed command: IDENTIFY DEVICE

    Jun  8 23:51:52 Tower kernel: ata4.00: cmd ec/00:00:00:00:00/00:00:00:00:00/00 tag 0 pio 512 in

    Jun  8 23:51:52 Tower kernel:          res 40/00:ff:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)

    Jun  8 23:51:52 Tower kernel: ata4.00: status: { DRDY }

    Jun  8 23:51:52 Tower kernel: ata4: hard resetting link

    Jun  8 23:51:58 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)

    Jun  8 23:52:02 Tower kernel: ata4: SRST failed (errno=-16)

    Jun  8 23:52:02 Tower kernel: ata4: hard resetting link

    Jun  8 23:52:06 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

    Jun  8 23:52:06 Tower kernel: ata4.00: configured for UDMA/133

    Jun  8 23:52:06 Tower kernel: ata4: EH complete

    Jun  9 00:00:55 Tower kernel: ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen

    Jun  9 00:00:55 Tower kernel: ata4.00: failed command: READ FPDMA QUEUED

    Jun  9 00:00:55 Tower kernel: ata4.00: cmd 60/00:00:00:5b:10/02:00:00:00:00/40 tag 0 ncq 262144 in

    Jun  9 00:00:55 Tower kernel:          res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

    Jun  9 00:00:55 Tower kernel: ata4.00: status: { DRDY }

    Jun  9 00:00:55 Tower kernel: ata4: hard resetting link

    Jun  9 00:01:01 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)

    Jun  9 00:01:05 Tower kernel: ata4: SRST failed (errno=-16)

    Jun  9 00:01:05 Tower kernel: ata4: hard resetting link

    Jun  9 00:01:11 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)

    Jun  9 00:01:15 Tower kernel: ata4: SRST failed (errno=-16)

    Jun  9 00:01:15 Tower kernel: ata4: hard resetting link

    Jun  9 00:01:21 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)

    Jun  9 00:01:21 Tower kernel: ata4.00: configured for UDMA/133

    Jun  9 00:01:21 Tower kernel: ata4.00: device reported invalid CHS sector 0

    Jun  9 00:01:21 Tower kernel: ata4: EH complete

    Jun  9 00:19:20 Tower kernel: mdcmd (394): spindown 9

    Jun  9 00:45:02 Tower kernel: mdcmd (548): spindown 0

    Jun  9 00:45:03 Tower kernel: mdcmd (549): spindown 2

    Jun  9 00:45:03 Tower kernel: mdcmd (550): spindown 5

    Jun  9 00:45:03 Tower kernel: mdcmd (551): spindown 8

    Jun  9 00:45:13 Tower kernel: mdcmd (553): spindown 2

    Jun  9 00:47:40 Tower sSMTP[12676]: Creating SSL connection to host

    Jun  9 00:47:41 Tower sSMTP[12676]: SSL connection using DHE-RSA-AES256-SHA

    Jun  9 00:47:42 Tower sSMTP[12676]: Sent mail for root@localhost (221 2.0.0 omta17.westchester.pa.mail.comcast.net comcast closing connection)

    Jun  9 04:00:17 Tower kernel: mdcmd (1719): spindown 0

    Jun  9 04:00:17 Tower kernel: mdcmd (1720): spindown 8

    Jun  9 05:02:12 Tower kernel: mdcmd (2090): spindown 0

    Jun  9 07:40:07 Tower kernel: usb 4-1: USB disconnect, address 2

    Jun  9 07:40:07 Tower kernel: usb 4-1.1: USB disconnect, address 3

    Jun  9 07:40:07 Tower kernel: usb 4-1.3: USB disconnect, address 4

    Jun  9 07:41:47 Tower unmenu[3699]: Disk /dev/sda doesn't contain a valid partition table

    Jun  9 07:44:03 Tower kernel: mdcmd (3068): spindown 8

    Jun  9 07:44:13 Tower kernel: mdcmd (3070): spindown 8

    Jun  9 07:44:13 Tower kernel: mdcmd (3071): spindown 9

    Jun  9 07:44:24 Tower kernel: mdcmd (3073): spindown 0

    Jun  9 07:44:35 Tower kernel: mdcmd (3075): spindown 0

    Jun  9 07:44:35 Tower kernel: mdcmd (3076): spindown 2

    Jun  9 07:44:45 Tower kernel: mdcmd (3078): spindown 5

     

    I'll attach the full syslog if there is any further information that might help in explaining.  thanks for your help.

    syslog-2011-06-09.txt

×
×
  • Create New...