Another disk issue on storage01 cropped up, and I haven’t even finished documenting the fault that happened last February 21, 2025. This time, it’s sdb1 acting up. Thankfully, I was able to resolve the issue without any data loss, but it was a close call.

angry kratos

Incident

  • I/O Error on /dev/sdb1 during scrub operation.

Severity

  • High

Impact

  • Data corruption on /dev/sdb1, potential data loss
  • sdb1 is one branch of the mergerfs volume served by storage01, losing access to sdb1 effectively halves the usable capacity of the Network File System

Affected Systems

  • storage01

Timeline

  • 2025-02-27: I/O errors detected during snapraid scrub operation.
  • 2025-02-27: XFS filesystem on /dev/sdb1 shuts down due to log I/O errors.
  • 2025-03-01: Troubleshooting steps initiated to identify the root cause.

Logs from dmesg

    [145729.843018] sd 1:0:0:10: [sdb] tag#84 Sense Key : Aborted Command [current]
    [145729.843019] sd 1:0:0:10: [sdb] tag#84 Add. Sense: I/O process terminated
    [145729.843021] sd 1:0:0:10: [sdb] tag#84 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
    [145729.843023] I/O error, dev sdb, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 2
    [145729.850513] sd 1:0:0:10: [sdb] tag#201 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s
    [145729.850517] sd 1:0:0:10: [sdb] tag#201 Sense Key : Aborted Command [current]
    [145729.850518] sd 1:0:0:10: [sdb] tag#201 Add. Sense: I/O process terminated
    [145729.850519] sd 1:0:0:10: [sdb] tag#201 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
    [145729.850540] I/O error, dev sdb, sector 3909372392 op 0x1:(WRITE) flags 0x9800 phys_seg 1 prio class 2
    [145729.850549] XFS (sdb1): log I/O error -5
    [145729.850569] XFS (sdb1): Log I/O Error (0x2) detected at xlog_ioend_work+0x6e/0x70 [xfs] (fs/xfs/xfs_log.c:1378).  Shutting down filesystem.
    [145729.850749] XFS (sdb1): Please unmount the filesystem and rectify the problem(s)

Detection

Alert from Discord

Discord Notification

Disk Metrics during fault

i/o metrics

Server Power Consumption during fault

power consumption

Troubleshooting Steps

I started by trying to unmount the affected filesystem, but both umount /dev/sdb1 and umount /mnt/data01 failed because the device was busy. I then tried to see what resources were using the disk with fuser -a /mnt/data01 and lsof /mnt/data01, but those commands failed due to I/O errors.

Next, I tried some hardware intervention. I stopped the VM and the server, replaced the SATA cable, and then restarted everything. Unfortunately, that didn’t resolve the issue.

Unmount Attempt
  • umount /dev/sdb1 and umount /mnt/data01 failed due to the device being busy.
Resource Usage Check
  • fuser -a /mnt/data01 and lsof /mnt/data01 failed to provide information due to I/O errors.
Hardware Intervention
  • VM and server stopped.
  • SATA cable replaced. Slim SATA data cable was previously used. I’ve replaced the slim blue SATA cable with a thicker red cable.
  • Disk was previously attached to LSI HBA Card. Now I derectly attached it to a free SATA port on the motherboard.
  • Server and VM restarted.

red cable

XFS Filesystem Repair and Bad Block Check

I decided to try repairing the XFS filesystem and checking for bad blocks on the disk. I ran xfs_repair -n /dev/sdb1 to assess the filesystem without making any modifications, and then I ran xfs_repair /dev/sdb1 to fix any detected issues. While that was running, I initiated a bad block check with badblocks -v -n /dev/sdb.

  • xfs_repair -n /dev/sdb1 run to assess the filesystem without modifications.
  • xfs_repair /dev/sdb1 run to fix detected issues.
  • badblocks -v -n /dev/sdb initiated to check for bad blocks on the disk.

After an 8 hour wait, xfs_repair completed successfully. It took 7 hours and 48 minutes to run on the 4TB disk, but it managed to repair the filesystem. The badblocks scan also completed after 2 days and 4 hours, and thankfully, it didn’t find any bad blocks. This meant that the I/O errors were likely due to a filesystem corruption rather than a physical problem with the disk.

xfs_repair

    [root@storage01 ~]# xfs_repair /dev/sdc1
    [root@storage01 ~]# xfs_repair /dev/sdc1
    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
            - zero log...
            - scan filesystem freespace and inode maps...
            - found root inode chunk
    Phase 3 - for each AG...
            - scan and clear agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
            - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
            - setting up duplicate extent list...
            - check for inodes claiming duplicate blocks...
            - agno = 1
            - agno = 2
            - agno = 0
            - agno = 3
    clearing reflink flag on inodes when possible
    Phase 5 - rebuild AG headers and trees...
            - reset superblock...
    Phase 6 - check inode connectivity...
            - resetting contents of realtime bitmap and summary inodes
            - traversing filesystem ...
            - traversal finished ...
            - moving disconnected inodes to lost+found ...
    Phase 7 - verify and correct link counts...
    done

    [root@storage01 ~]# badblocks -v -n /dev/sdc
    Checking for bad blocks in non-destructive read-write mode
    From block 0 to 3907018583
    Testing with random pattern:

xfs_repair Phases

xfs_repair as shown above, operates in a series of distinct phases:

  • 1 Find and Verify Superblock: Locates and checks the superblock, a critical structure holding filesystem metadata.
  • 2 Using Internal Log: Utilizes the filesystem’s journal log to scan freespace and inode maps.
  • 3 Process AGs: Scans and clears Allocation Group Information (AGI) unlinked lists and processes known inodes.
  • 4 Check for Duplicate Blocks: Identifies duplicate blocks, which often indicate data corruption.
  • 5 Rebuild AG Headers and Trees: Reconstructs AG headers and trees, vital for filesystem organization.
  • 6 Check Inode Connectivity: Ensures all files and directories are correctly linked.
  • 7 Verify and Correct Link Counts: Checks and modifies link counts, which track directory references to files.

It also offers a “no modify” mode, using the command xfs_repair -n /dev/sdc. This mode skips the actual repair phases and only detects filesystem corruption, giving a safe way to assess the damage without making changes.

In my case, the full xfs_repair command took 7 hours and 48 minutes to complete for the 4TB disk. Thankfully, the subsequent badblocks scan, which ran for 2 days, 4 hours, and 25 minutes, found no bad blocks, indicating that the I/O errors were likely due to filesystem corruption rather than a physical disk problem.

After the successful repair, I remounted the disk and ran a scrub, and everything is now back to normal.

        SCRUB finished [Tue Mar  4 11:56:11 PM +04 2025]
        ----------------------------------------
        Self test...
        Loading state from /mnt/data01/SnapRAID.content...
        Using 846 MiB of memory for the file-system.
        SnapRAID status report:

        Files Fragmented Excess  Wasted  Used    Free  Use Name
                Files  Fragments  GB      GB      GB
        554868      98     981   608.0    1867     841  69% d1
        541458     152     433   258.6    1500    1211  56% d2
        --------------------------------------------------------------------------
        1096326     250    1414   866.6    3367    2052  63%


        18%|                                               o
        |                                               *
        |                                               *
        |                                               *
        |                                               *
        |                                               *
        |                                               *
        9%|   o   o  o   o          o                     *  o                  o
        |   *   *  *   *      o   *                     *  *                  *
        |   *   *  *   *  o   *   *                     *  *                  *
        |   *   *  *   *  *   *   *                     *  *                  *
        |   *   *  *   *  *   *   *                     *  *                  *
        |   *   *  *   *  *   *   *                     *  *                  *
        |   *   *  *   *  *   *   *                     *  *                  *
        0%|*__*___*__*___*__*___*___*_____________________*__*__o_______________*
        19                    days ago of the last scrub/sync                 0

        The oldest block was scrubbed 19 days ago, the median 13, the newest 0.

        No sync is in progress.
        2% of the array is not scrubbed.
        You have 119 files with a zero sub-second timestamp.
        Run 'snapraid touch' to set their sub-second timestamps to a non-zero value.
        No rehash is in progress or needed.
        No error detected.

References