A Global-Local Part-Shift Network for Gait Recognition
doi: 10.11916/j.issn.1005-9113.24064
Guizhi Li , Weiwei Fang
School of Computer Science, Beijing Information Science and Technology University, Beijing 100192 , China
Abstract
Gait recognition, a promising biometric technology, relies on analyzing individuals' walking patterns and offers a non-intrusive and convenient approach to identity verification. However, gait recognition accuracy is often compromised by external factors such as changes in viewpoint and attire, which present substantial challenges in practical applications. To enhance gait recognition performance under diverse viewpoints and complex conditions, a global-local part-shift network is proposed in this paper. This framework integrates two novel modules: the part-shift feature extractor and the dynamic feature aggregator. The part-shift feature extractor strategically shifts body parts to capture the intrinsic relationships between non-adjacent regions, enriching the recognition process with both global and local spatial features. The dynamic feature aggregator addresses long-range dependency issues by incorporating multi-range temporal modeling, effectively aggregating information across parts and time steps to achieve a more robust recognition outcome. Comprehensive experiments on the CASIA-B dataset demonstrate that the proposed global-local part-shift network delivers superior performance compared with state-of-the-art methods, highlighting its potential for practical deployment.
0 Introduction
As a technique used to recognize a person based on the distinctive way they walk, Gait recognition has gained increasing interest recently because of its possible uses across various domains such as security, forensics, and healthcare. From an individual's gait, it is possible to identify them from a distance and without requiring any physical contact.
While gait recognition has many potential benefits, it also poses some challenges. Gait patterns can be affected by camera viewpoints and clothing, making it difficult to identify individuals accurately.
In order to solve these problems, many methods based on deep learning have been proposed. He et al.[1] proposed to use a multi-task generative adversarial learning network to learn the feature representation of a specific perspective, and improve the representation ability of features. GaitSet[2] regards the gait sequence as an unordered set, and obtains more discriminative features through the designed global and local feature extraction dual-branch structure, and finally uses pyramid pooling for feature mapping to get the final representation. In order to obtain more fine-grained spatial information, GaitPart[3]was proposed to divide the gait picture into different parts and perform feature extraction separately, and then model the temporal features through the proposed micro-motion capture module, and uses the attention mechanism to obtain short-term temporal features and eliminate redundant long-range information.
These methods only focus on the local information of the specific receptive field of the convolution kernel size on a single image, ignoring the internal feature relations between different parts. Therefore, to mine the internal relationship between different parts, improve the global perception ability while maintaining the local perception ability of the feature, a global-local part-shift network is proposed in this study to improve the discrimination ability of the feature:
1) We propose a part-shift module to recombine the features between different parts, and then perform feature extraction to improve the global perception ability of the module.
2) For the global-local features obtained in different parts, we propose a dynamic temporal aggregator to model temporal features. This module can not only model short-range features, but also obtain long-range features. Which can alleviate the long-range dependency problem and improve the ability of temporal modeling.
3) Extensive experiments on the CASIA-B dataset demonstrate the superiority of our method.
1 Related Work
Current gait recognition methods based on deep learning can broadly be categorized into model-based methods[4-6] and appearance-based methods[2-3, 7-8]. Below, we review and critically analyze representative methods from both categories, explicitly highlighting their limitations and clearly positioning our method in relation to them.
1.1 Model-Based Methods
Model-based approaches primarily rely on extracting structured features such as human pose or body joints to achieve gait recognition. For instance, PoseGait[5] leverages 3D human pose estimation to effectively mitigate the impact of viewpoint and clothing variations. It combines CNN and LSTM to model spatial and temporal information jointly and employs a multi-loss strategy for optimization. GaitGraph[6] utilizes 2D pose estimation and extracts spatial features via graph convolutional neural networks (GCNs) , thus effectively capturing the structural dependencies among body joints. Furthermore, the approach proposed in Ref.[4]uses a multi-linear human model and fine-tunes a pre-trained human mesh recovery (HMR) network for gait recognition. Despite the effectiveness of these methods in handling pose variations, they are highly dependent on accurate pose estimation. Pose inaccuracies or occlusions can severely deteriorate their performance. Additionally, these methods usually focus primarily on structural information while neglecting richer appearance-based cues.
1.2 Appearance-Based Methods
Appearance-based methods directly extract features from silhouette images or gait sequences without explicitly modeling the human pose[9-16]. A representative work, Gait Energy Image (GEI) [17], compresses temporal gait information into a single template image, effectively representing individual walking patterns. However, the primary drawback of GEI-based methods, as pointed out in Ref.[18], is the substantial loss of temporal dynamics. GEnI [18] addresses robustness concerns by proposing gait entropy images, but temporal information loss persists. Extending GEI, Periodic Energy Image (PEI) [19] proposes a multi-channel gait template for capturing more discriminative features via adversarial training, yet this method remains view-dependent and still compromises temporal granularity[20-24]. To alleviate temporal information loss, recent methods utilize raw gait frame sequences as network inputs. The gait lateral network structure in Ref.[7]effectively captures discriminative compact features directly from gait contour sequences, significantly reducing feature redundancy without accuracy loss. Meanwhile, GaitPart[3] emphasizes local spatial-temporal characteristics by segmenting gait sequences into distinct body parts to extract fine-grained features. GaitSlice[8] further enhances subtle spatial interactions among adjacent gait parts through its SED module, incorporating a frame attention mechanism for temporal feature selection. Although these appearance-based methods have made considerable progress, a common limitation is their predominant reliance on local receptive fields via convolution operations, which restricts them to localized spatial-temporal interactions. Consequently, they fail to explicitly model the inherent global relationship among different gait parts.
Different from the aforementioned methods, our proposed global-local part-shift network explicitly addresses these limitations. Specifically, our part-shift module strategically recombines features across distinct gait parts, capturing intrinsic relationships between non-adjacent regions. This approach enhances global spatial perception while simultaneously preserving local feature integrity. To further mitigate limitations in temporal modeling, particularly concerning long-range dependencies, we introduce a dynamic feature aggregator. This module employs multi-range temporal modeling to effectively integrate both short-range and long-range temporal interactions within gait sequences, thus significantly improving the robustness and discriminative power of the recognition process.
2 Method
In this section, we first provide a brief introduction to our proposed approach, followed by a detailed description of the Part-Shift Feature Extractor (PSFE) module, and finally present the Dynamic Feature Aggregator (DFA) module.
2.1 Overview
As shown in Fig.1, our network receives a sequence of gait contours xi as input, where i∈1, 2, ···, T, T denotes the number of frames.xi first is input to the PSFE, the PSFE enhances the richness of features by mining the internal relations of different body parts, and obtains the global part-aware vector Pg and local part-aware vector Pl :
Pgi,P1i=PSFExi
(1)
Afterwards, through the Horizontal Pooling (HP) operation (it is worth noting that the maximum function is used as the horizontal pooling strategy in this study) , the features are mapped to obtain the decisive spatial feature vector Dg and Dl :
Dgi,Dli=HPPgi,Pli
(2)
The decisive spatial features are sent to the DFA. The DFA contains two sub-modules, Multi-Range Modeling (MRM) and Feature Aggregator (FA) . MRM maps features into short-range temporal features Ts and long-range temporal features Tl in the temporal dimension:
Tsmi,T1mi=MRMDmi
(3)
where m=g, l, that is, the global salient spatial features Dgi and local salient spatial features Dli will be sent to the MRM for temporal modeling, but they do not share parameters.
Then, the global-local multi-range temporal information is sent to the FA for mapping to obtain the fusion feature Fi:
Fi=FATmni
(4)
where m=s, l; n=g, l, the global-local long-range and short-range four types of features are used as input.
Finally, map Fi through seperate Fully Connected Layer (FC) to get the final feature representation Xi:
Xi=FCFi
(5)
Fig.1Overview of the proposed method
2.2 PSFE
The part-shift extractor aims to improve the perception of long-range spatial information while retaining the ability to extract fine-grained spatial features. It consists of two Part-Shift Convolution (PSConv) modules. The PSConv is introduced in detail below.
There are primarily two approaches to frame-level spatial feature extraction: the ordinary 2D convolution method and the part-based method. The part-based approach involves segmenting the input frame into distinct regions, followed by the application of 2D convolution within each region. However, both methods are limited to capturing local spatial features, lacking the capacity to model comprehensive spatial relationships across the entire frame. To address this limitation, we propose an apart-shift convolution module grounded in an apart-aware approach, which facilitates the exchange of positions across segmented regions. As illustrated in Fig.2, part-shift enables previously non-adjacent body parts to become adjacent, allowing convolutional operations to explicitly capture interconnections among these parts and enhance global spatial modeling.
The specific operation is illustrated in Fig.3. The input feature map is divided into distinct segments, referred to as gait stripes. Initially, the sequentially arranged raw feature maps are processed by a convolutional neural network to model features within each segment. Subsequently, the part-shift operation reorders the feature maps across different segments.
Fig.2After the part-shift change, the comparison in the change in the perception range
Fig.3The detailed structure of PSConv module
After exchanging the feature map, we model the original non-adjacent feature points through 2D convolution to obtain the intrinsic connection between long-distance spatial feature points.
A part-shift convolution module can be expressed as follows :
Fi=Conv2dPartShiftConv2dxi
(6)
where i∈1, 2, ···, T, T denotes the frame number of input gait sequence.
2.3 Dynamic Temporal Feature Aggregator
The dynamic feature aggregator further performs temporal modeling of global-local, frame-level spatial features across various granularities to address challenges related to long-range dependencies. This module comprises two independent multi-range feature aggregators, which do not share parameters, as well as the final feature aggregator. A detailed introduction to these modules will be provided subsequently.
Unlike other biometric recognition techniques, gait recognition uniquely requires feature modeling across two domains: temporal and spatial. Thus, after extracting frame-level spatial features through the PSFE, additional temporal feature modeling is necessary to capture dynamic characteristics.
2.3.1 Multi-range temporal modeling
Feature modeling in the temporal dimension is often susceptible to long-range dependency issues. To enhance the expressiveness of these features, we employ a multi-scale approach to capture features at distinct scales. Specifically, convolutional kernels of varying sizes are utilized to convolve features within the temporal domain. As convolutional kernels of different sizes correspond to different receptive fields, they influence the convolution operation's perception range uniquely. Larger convolutional kernels yield a broader receptive field, which helps mitigate long-range dependency issues. Accordingly, we apply convolution operations across two scales to capture features at varying temporal resolutions, as shown below:
Temporal longrange =Conv1xi
(7)
Temporal shortrange =Convsxi
(8)
where xi refers to feature map sequence, i∈1, 2, ···, T, T denotes the number of frames.
2.3.2 Feature aggregator
This module weighted the vector of global-local features for aggregation. As shown in Fig.4, the feature aggregator consists of two parallel branches, the part-score branch and the weighted function branch. The former branch further models each spatio-temporal feature vector to highlight the most discriminative spatio-temporal features, specifically for each feature, the vector calculates the importance score through the attention mechanism, and the specific operation is shown as follows:
PartScore =SigmoidConv1dGeluConv1dxi
(9)
In the part-selection module, the original feature vector is mapped to a scoring feature vector using a max function operation. This transformation can be formally expressed by the following equation:
ScoringVector =MaxPartScorexi
(10)
The weighted function is used to aggregate prominent temporal features and process the features to eliminate possible temporal domain aliasing. The specific operation of the weighted function is shown as follows:
WeightedFunction = AvgPool ()+MaxPool()
(11)
Through the above formula, we obtain the weight vector for calculation. The decisive spatio-temporal feature vector obtained through the dot product operation of the weighted vector and the scoring vector is shown as follows:
DecisiveVector = WeightedVector PartVector
(12)
Fig.4The abstract structure of dynamic feature aggregator in practice
3 Experiments
A large number of experiments were conducted on two commonly used datasets to evaluate the proposed algorithm in this section. The relevant details of the CASIA-B dataset[25] are introduced in Section 3.1. In Section 3.2, the experimental settings of the proposed algorithm are described, and in Section 3.3, a comparison of experimental results between the proposed method and current state-of-the-art methods on the CASIA-B dataset [25] is presented.
3.1 Datasets
As a popular benchmark dataset for evaluating gait recognition, CASIA-B[25] contained 124 subjects' gait sequences. The subject ( from 001 to 124) has 110 sequences in totally, which are captured from 11 different viewpoints with gaps of 18° ( 0° to 180°) . For every viewpoint, 10 sequences are collected under three scenarios: six sequences with normal walking, two sequences carrying a backpack, and two sequences wearing a coat.
We follow dataset partitioning strategies from Refs.[3]and [2], dividing CASIA-B into Large-sample training (LT) , Medium-sample training (MT) , and Small-sample training (ST) setups, with specified training/testing splits, using NM (#01–04) sequences as gallery sets and NM (#05–06) , BG (#01–02) , CL (#01–02) sequences as probe subsets.
3.2 Implementation Details
Experiments were conducted on NVIDIA 3090 GPUs using silhouette alignment, Adam optimizer (learning rate10-4, momentum 0.2) , batch-all separate triplet loss (margin 0.2) , and a batch size of (8, 12) trained for 100000 iterations.
3.3 Comparison to State-of-the-Art Methods
To assess the performance of our approach under cross-view conditions, we conducted a detailed comparison to several leading methods. As presented in Table1, our method achieves higher recognition accuracy across most viewpoints compared with the seven reference methods. Specifically, for NM/BG/CL conditions, the average recognition accuracy of our method surpasses that of CaitSlice [8] by 0.6%, 1.7%, and 2.3%, respectively. Although our proposed global-local part-shift network significantly improves upon previous single-modal gait recognition methods, it achieves slightly lower average recognition accuracy compared with recent multimodal approaches such as AttnGait[26] and GMSN[27], specifically 1.9% and 4% lower, respectively. This performance gap is understandable given that multimodal methods leverage complementary information from multiple modalities (e.g., appearance, pose, or depth data) , naturally granting them higher discriminative capabilities.
However, the method proposed in this paper utilizes only single-modal gait silhouettes and still achieves competitive accuracy. Moreover, our approach introduces notable improvements in modeling internal relationships among non-adjacent body parts and effectively addresses long-range temporal dependencies within gait sequences. These enhancements offer substantial value, especially in scenarios where multimodal data collection may be constrained by privacy concerns, deployment costs, or hardware limitations. Future research could explore integrating our global-local part-shift network with complementary modalities to further bridge this performance gap.
Table1Rank-1 accuracy averaged across three experimental settings on CASIA-B, excluding same-view comparisons
Additionally, we evaluated the performance of our approach in data-limited scenarios, with experimental results illustrated in Fig.5. These results indicate that our method significantly outperforms both GaitSlice [8] and MT3D[30]. In particular, the average recognition accuracy under ST and MT settings is 1.8% and 1.4% higher than GaitSlice[8], and 0.8% and 1.1% higher than MT3D[30], respectively. The superior accuracy under MT further underscores the robustness and efficiency of our method compared with GaitSlice[8].
Fig.5Benchmarking against leading approaches under ST/MT configurations
4 Conclusions
In this work, we introduce a distinct global-local representation learning framework that leverages a part-shift mechanism specifically designed to overcome constraints from insufficient visual information and external interferences in gait recognition. The proposed framework comprises two specialized modules: the PSFE and the DFA. Specifically, the PSFE strategically shifts and recombines posture segments to capture intrinsic connections between non-adjacent body parts, thereby significantly enriching both global and local spatial features. Concurrently, the DFA employs multi-range temporal modeling, effectively preserving short-range spatio-temporal patterns and addressing long-range temporal dependencies. Extensive experiments conducted on the widely used CASIA-B dataset confirm the effectiveness and competitive performance of our method, highlighting its practical applicability and robustness against challenging variations.
Fig.1Overview of the proposed method
Fig.2After the part-shift change, the comparison in the change in the perception range
Fig.3The detailed structure of PSConv module
Fig.4The abstract structure of dynamic feature aggregator in practice
Fig.5Benchmarking against leading approaches under ST/MT configurations
Table1Rank-1 accuracy averaged across three experimental settings on CASIA-B, excluding same-view comparisons
He Y, Zhang J, Shan H,et al. Multi-task gans for view-specific feature learning in gait recognition. IEEE Transactions on Information Forensics and Security,2019,14(1):102-113. DOI:10.1109/TIFS.2018.2844819.
Chao H, He Y, Zhang J,et al. Gaitset: Regarding gait as a set for cross-view gait recognition. Proceedings of the 33rd AAAI Conference on Artificial Intelligence, AAAI 2019,31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI. Hong Kong: AAAI Press,2019:8126-8133. DOI:10.1609/aaai.v33i01.33018126.
Fan C, Peng Y, Cao C,et al. Gaitpart: Temporal part-based model for gait recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE,2020:14225-14233. DOI:10.1109/CVPR42600.2020.01423.
Li X, Makihara Y, Xu C,et al. End-to-end model-based gait recognition. Proceedings of the Asian Conference on Computer Vision. Berlin: Springer,2020,12642:3-20. DOI:10.1007/978-3-030-69535-4_1.
Liao R, Cao C, Garcia E B,et al. Pose-based temporal-spatial network(PTSN)for gait recognition with carrying and clothing variations. Biometric Recognition:12th Chinese Conference, CCBR 2017. Berlin: Springer,2017:474-483. DOI:10.1007/978-3-319-69923-3_51.
Teepe T, Khan A, Gilg J,et al. Gaitgraph: Graph convolutional network for skeleton-based gait recognition. Proceedings of the 2021 IEEE International Conference on Image Processing(ICIP). Piscataway: IEEE,2021:2314-2318. DOI:10.48550/arXiv.2101.11228.
Hou S, Cao C, Liu X,et al. Gait lateral network: Learning discriminative and compact representations for gait recognition. Computer Vision-ECCV 2020. ECCV 2020. Lecture Notes in Computer Science. Cham: Springer,2020,12354:382-398. DOI:10.1007/978-3-030-58545-7_22.
Li H, Qiu Y, Zhao H,et al. GaitSlice: A gait recognition model based on spatio-temporal slice features. Pattern Recognition,2022,124:108453. DOI:1016/j.patcog.2021.108453.
Wan J, Zhao H, Li R,et al. Omni-domain feature extraction method for gait recognition. Mathematics,2023,11(12):2612. DOI:10.3390/math11122612.
Aung S T Y, Kusakunniran W. A comprehensive review of gait analysis using deep learning approaches in criminal investigation. PeerJ Computer Science,2024:e2456. DOI:10.7717/peerj-cs.2456.
Xiao J, Yang H, Xie K,et al. Learning discriminative representation with global and fine-grained features for cross-view gait recognition. CAAI Transactions on Intelligence Technology,2022,7(2):187-199. DOI:10.1049/cit2.12051.
Wang Z, Hou S, Zhang M,et al. LandmarkGait: Intrinsic human parsing for gait recognition. Proceedings of the 31st ACM International Conference on Multimedia. New York: ACM,2023:2305-2314. DOI:10.1145/3581783.3611840.
Huo W, Wang K, Tang J,et al. GaitSCM: Causal representation learning for gait recognition. Computer Vision and Image Understanding,2024,243:103995. DOI:10.1016/j.cviu.2024.103995.
Feng Y, Yuan J, Fan L. GaitFusion: Exploring the fusion of silhouettes and optical flow for gait recognition. International Conference on Artificial Neural Networks. Cham: Springer,2023:88-99.
Wang L, Chen J, Liu Y. Frame-level refinement networks for skeleton-based gait recognition. Computer Vision and Image Understanding,2022,222:103500. DOI:10.1016/j.cviu.2022.103500.
Zhang Z, Wei S, Xi L,et al. GaitMGL: Multi-scale temporal dimension and global-local feature fusion for gait recognition. Electronics,2024,13(2):257. DOI:10.3390/electronics13020257.
Han J, Bhanu B. Individual recognition using gait energy image. IEEE Transactions on Pattern Analysis and Machine Intelligence,2005,28(2):316-322. DOI:10.1109/TPAMI.2006.38.
Bashir K, Xiang T, Gong, S. Gait recognition using gait entropy image.3rd International Conference on Imaging for Crime Detection and Prevention. Piscataway: IEEE,2009:1-6. DOI:10.1049/ic.2009.0230.
Wang K, Liu L, Lee Y,et al.,2019. Nonstandard periodic gait energy image for gait recognition and data augmentation. Pattern Recognition and Computer Vision: Second Chinese Conference, PRCV 2019. Cham: Springer,2019:197-208. DOI:10.1007/978-3-030-31723-2_17.
Xiong J, Zou S, Tang J. DFGait: Decomposition fusion representation learning for multimodal gait recognition. International Conference on Multimedia Modeling. Cham: Springer,2024:381-395. DOI:10.1007/978-3-031-53311-2_28.
Bai S, Chang H, Ma B. Incorporating texture and silhouette for video-based person re-identification. Pattern Recognition,2024,156:110759. DOI:10.1016/j.patcog.2024.110759.
Chen B, Niu T, Yu W,et al. A-net: An a-shape lightweight neural network for real-time surface defect segmentation. IEEE Transactions on Instrumentation and Measurement,2023,73:1-14. DOI:10.1109/TIM.2023.3341115.
Cui C, Liu L, Qiao R. A cutting-edge video anomaly detection method using image quality assessment and attention mechanism-based deep learning. Alexandria Engineering Journal,2024,108:476-485. DOI:10.1016/j.aej.2024.07.103.
Liu F, Wang Q, Xiao Y,et al.2025. An efficient and effective pore matching method using ResCNN descriptor and local outliers. Pattern Recognition,2025,163:111446. DOI:10.1016/j.patcog.2025.111446.
Yu S, Tan D, Tan T. A framework for evaluating the effect of view angle,clothing and carrying condition on gait recognition. Proceedings of the 18th International Conference on Pattern Recognition(ICPR'06). Piscataway: IEEE,2006:441-444. DOI:10.1109/ICPR.2006.67.
Castro F M, Delgado-Escaño R, Hernández-García R,et al. AttenGait: Gait recognition with attention and rich modalities. Pattern Recognition,2024,148:110171. DOI:10.1016/j.patcog.2023.110171.
Wei T, Liu M, Zhao H,et al. GMSN: An efficient multi-scale feature extraction network for gait recognition. Expert Systems with Applications,2024,252:124250. DOI:10.1016/j.eswa.2024.124250.
Qin H, Chen Z, Guo Q,et al. RPNet: Gait recog-nition with relationships between each body-parts. IEEE Transactions on Circuits and Systems for Video Technology,2021,32:2990-3000. DOI:10.1109/TCSVT.2021.3095290.
Fu Y, Meng S, Hou S,et al. GPGait: Generalized pose-based gait recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision. Paris: CVF,2023:19595-19604.
Lin B, Zhang S, Bao F. Gait recognition with multiple-temporal-scale 3d convolutional neural network. Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM,2020:3054-3062. DOI:10.1145/3394171.3413861.

LINKS