Abstract:To address the issues of single-category defect identification and low segmentation accuracy in current digital image-based methods for concrete bridge defect detection, a refined semantic segmentation model named HDNet, which is built upon an encoder-decoder architecture, was introduced. In terms of encoder design, a hierarchical window-based self-attention mechanism was implemented, which combinnes sliding window partitioning and cross-layer residual connections to enhance gradient propagation. A kernelized attention module was incorporated to strengthen gradient responses for local defects, such as erosion and cracks, while simultaneously reducing interference from the background texture of the bridge. A pixel-deformation dual-path architecture was adopted in the decoder, in which the pixel path employs pointwise feature mapping to capture the morphological details of cracks and the deformation path utilizes deformable convolutions to adaptively match the irregular geometric contours of spalling regions. A series of experiments were carried out on a high-resolution dataset of bridge defects including four categories of defects: cracks, erosion, exposed rebar, and spalling, which was captured by unmanned aerial vehicle(UAV). Comparisons with those dominant models such as DeepLabV3+ and SegFormer were performed, and then ablation study analysis, heatmap analysis and real-bridge validation were carried out. The results indicate that HDNet attains a mean Intersection over Union (mIoU) of 71.91% on the validation set, surpassing the suboptimal model SegFormer by 7.86%. Ablation studies validate the necessity of kernelized attention (which improves mRecall by 5.83%), hierarchical sliding-window attention (which boosts mIoU by 5.92%), and the synergistic design with the Dice loss function. Heatmap analysis demonstrates HDNet’s ability to accurately capture defect texture details and disentangle the semantic boundaries of co-occurring defects. In real-bridge testing, HDNet maintains the relative error of defect size measurement within ±5%, which confirms its practical applicability. By integrating encoder-decoder co-optimization and cross-resolution hierarchical enhancement mechanisms, HDNet substantially enhances the recognition accuracy and robustness for complex bridge defects, thereby offering a high-precision technology for the intelligent detection of bridge surface deterioration.