Abstract
To enhance the recognition accuracy of current network models for apple leaf diseases, a lightweight model that leverages an enhanced MobileNetV3-Small architecture is introduced in this study. The improved model utilizes MobileNetV3-Small, a lightweight architecture with fewer parameters, serving as the primary network for feature extraction. It integrates a weighted bi-directional feature pyramid network that fuses multi-scale features, thereby enhancing the model's capacity to detect disease characteristics across various scales. Additionally, an efficient multi-scale attention mechanism is integrated to mitigate the influence of complex background noise in natural environments, further improving disease recognition accuracy. The experiment utilizes the AppleLeaf9 public dataset to classify healthy apple leaves and eight distinct disease types. The results indicate that, when using the augmented dataset, the improved model achieves a recognition accuracy of 95.98%, with only 1.72 M parameters, 123.16 M FLOPs, and an inference time of just 14.10 ms. Compared with eight other lightweight neural network models, including MobileNetV2,ShuffleNet_v2_1.5×, ResNet50, MobileNetV3-Large, EfficientNet-B0, MobileNetV3-Small, MobileNetV4-Conv-Small, and MobileNetV4-Conv-Medium, the improved model demonstrates superior accuracy. In particular, the proposed model achieves a recognition accuracy improvement of 0.93 percentage points compared with the baseline MobileNetV3-Small model. The optimized model introduced in this study effectively improves the accuracy in identifying diseases in apple leaves, while maintaining a low parameter count and fast inference speed, thus offering a novel approach for deploying disease recognition models on agricultural electronic devices.
Keywords
0 Introduction
Apple is one of the important cash crops in our country[1]. Diseases affecting apple leaves have significantly hindered the healthy growth of the apple industry[2]. Conventional approaches for detecting apple leaf diseases primarily depend on farmers' expertise and practical experience, which cannot provide real-time prevention and control of fruit trees. As a result, the disease is often not discovered until it has severely impacted apple quality, and in some cases, it even causes a significant reduction in apple production[3]. In recent years, advances in artificial intelligence have led to the extensive application of deep neural networks in disease recognition, as these networks can automatically extract features directly from original input data[4]. Zhang et al.[5] introduced a hybrid attention network designed for recognizing citrus diseases, which combines a frequency-domain attention mechanism and an SE attention module within the ResNet backbone. Additionally, they employed a large convolution kernel to enlarge the receptive field. Zhang et al.[6] used the maximum inter-class variance method based on super green feature to segment the image, and applied transfer learning and a residual network to construct a millet disease recognition model, achieving98.2% recognition accuracy for four types of millet diseases. Liang et al.[7] improved the SqueezeNet network by introducing a spatial attention mechanism and a dense connection module, constructing two models, SqueezeNet1 and SqueezeNet2, which achieved recognition accuracies of 89.60% and 94.37%, respectively, on the apple leaf disease dataset. Chen et al.[8] adopted DenseNet as the backbone architecture and substituted traditional convolutional layers with depthwise separable convolutions to decrease model parameters. They also introduced an enhanced Mobile-DANet framework, integrating both spatial and channel attention mechanisms, achieving a recognition accuracy of 98.5% on the open Plant Village maize disease dataset and 95.86% on the local dataset with complex background conditions. Hu et al.[9] developed a model named LE-EfficientNet aimed at identifying grape leaf diseases. Their approach enhanced the EfficientNet-B0 architecture by integrating a large kernel attention module and an efficient channel attention mechanism, thereby improving feature extraction capabilities related to disease detection. Experimental outcomes demonstrated that this method achieved a recognition accuracy increase of 1.58 percentage points on the Plant Village grape leaf disease dataset. Guo et al.[10] integrated the ET attention module within the MobileNetV3 architecture, optimized the model's fully connected layer and operator, and combined it with a transfer learning strategy to construct an apple leaf disease recognition model, achieving a disease recognition accuracy of 95.62%.
The studies mentioned above demonstrate that deep neural networks can achieve remarkable results in recognizing plant leaf diseases, even under conditions involving complicated backgrounds. Nevertheless, deep neural networks typically contain numerous parameters, resulting in high memory consumption on agricultural electronic devices deployed within farming environments, which limits their application in agricultural production. In addition, there are some problems in apple leaf disease recognition in natural environment, such as disease feature size, different shapes, scattered feature distribution and complex picture background. To address the aforementioned issues, this study presents MobileNet-BiFPN-EMA, a lightweight model designed for apple leaf disease recognition. The proposed model enhances recognition accuracy while keeping the parameter count low, reduces the memory burden on agricultural electronic devices.
1 Experimental Dataset and Pre-processing
The AppleLeaf9 dataset, created by Yang et al.[11], is used for the experiments in this paper. To guarantee precise labeling of images in the AppleLeaf9 dataset, Yang et al. invited agricultural disease experts to screen the data and remove mislabeled images. A total of 14582 images were obtained, 94% of which were taken in natural environments with complex backgrounds. Only 2.5% of the images are from the PVD dataset with static image backgrounds, making the dataset more representative of real-world application scenarios and enhancing the model's capability to generalize effectively in real-world environments.
Fig.1 illustrates sample images for each disease included in the AppleLeaf9 dataset. The dataset comprises healthy leaf samples and eight distinct disease categories: alternaria leaf spot, brown spot, frogeye leaf spot, gray spot, mosaic, powdery mildew, rust, and scab. Table1 provides detailed information for each disease type. To assess the performance of the model, the dataset is divided into training, validation, and test sets following a3∶1∶1 ratio, containing8754 training images, 2916 validation images, and 2912 test images, respectively. To enhance data diversity, six online data augmentation methods are applied to the training set images, including random cropping, random brightness and contrast adjustment, random rotation, random Gaussian blur, horizontal flipping, and vertical flipping, using the transforms provided by torchvision. The validation and test set images are not augmented to ensure the model's practicality. Fig.2 illustrates a representative example of a pre-processed image.

Fig.1Example images from dataset
2 Construction of Apple Leaf Disease Recognition Model
2.1 Principle of MobileNetV3-Small
MobileNetV3-Small[12] is a lightweight neural network architecture that traces its origins back to 2017, when Google researchers first proposed version V1[13]. Then, based on version V2[14], MobileNetV3-Small was designed using neural network architecture search technology, making it more accurate and efficient than versions V1 and V2. MobileNetV3-Small incorporates Depthwise Separable Convolution (DSConv) , Inverted Residual Block, SE attention mechanism[15] (Squeeze-and-Excitation, SE) , and uses the h-swish activation function instead of ReLU. Separable convolution effectively decreases the parameter count by dividing the conventional convolution operation into two distinct steps: depthwise convolution (Dwise) and pointwise convolution (Pwise) .
Table1The distribution of dataset


Fig.2Example images of dataset pre-processing
MobileNetV3-Small introduces a redesigned activation function, substituting the original ReLU with Hardswish, an enhanced variant of the swish activation function. The swish function is smoother than ReLU and can better capture the nonlinear features in images, thereby effectively improving the model's accuracy[16-17]. However, the exponential calculation involved in the swish function leads to increased computational complexity in the model. The Hardswish function approximates the swish function while requiring fewer computing resources, which makes it appropriate for scenarios with limited resources.
As shown in Fig.3, MobileNetV3-Small incorporates the SE attention mechanism into the inverted residual module of version V2, forming a new block that serves as the core module of MobileNetV3-Small. The inverted residual module structure shown on the left side of Fig.3 utilizes depthwise separable convolution and incorporates residual connections inspired by ResNet[18]. In the residual module of ResNet, the input feature map dimension is initially decreased and subsequently increased after convolution. Conversely, the inverted residual module first expands the input feature map dimension and then compresses it following convolution operations. The SE attention mechanism is a network based on channel attention. By using the SE module, the model can autonomously identify and learn the significance of various channels within feature maps. By assigning different weights to the channels, the model pays less attention to those with lower weights, thus suppressing less informative features and allocating higher attention to channels with greater importance, thereby enhancing the features of channels with more information. The SE attention module is composed of a global average pooling layer, two fully connected layers, and utilizes ReLU and Hsigmoid activation functions, as shown on the right side of Fig.3.

Fig.3MobileNetV3-Small kernel block
Fig.4 illustrates the architecture of the MobileNetV3-Small model, where the first Bneck is a depthwise separable module, and the rest of the Bneck blocks are shown in Fig.3. When the Bneck-based inverted residual module has the same number of input and output channels, and the stride required by the depthwise convolution operation is 1, the inverted residual modules use residual connections. Table2 provides the detailed network parameters for the MobileNetV3-Small model. In Table2, NBN signifies the absence of a convolutional layer following the batch normalization layer, #out indicates the number of output channels for each module, SE specifies whether the SE attention mechanism is utilized, and NL denotes the type of activation function used, #H is the h-swish function, #R is the ReLU function, and s is the stride length required for the depthwise convolution operation, k denotes the number of output channels in the model.

Fig.4Structure diagram of MobileNetV3-Small
Table2Parameters of MobileNetV3-Small

2.2 Multi-Scale Feature Fusion Network
In convolutional neural networks, shallow features with a large scale have small receptive fields, allowing them to capture rich detailed information and specific location information, which is useful for detecting small-sized targets. On the other hand, deep features with a small scale have large receptive fields, enabling them to capture the overall structure of image features and rich semantic information, which is beneficial for detecting large-sized targets.Due to the loss of some image details during multiple downsampling operations through the network, the feature information of small targets is lost. Nevertheless, the weighted bi-directional feature pyramid network[19] (BiFPN) effectively captures multi-scale feature information with only a minor increase in parameters, thereby preserving small-target details and boosting their detection performance. Fig.5 presents the corresponding network architecture. BiFPN uses 1×1 convolution to connect feature maps with different channel numbers horizontally, resulting in feature maps that maintain an identical channel count. Multi-scale features are subsequently integrated via top-down and bottom-up fusion paths. In the top-down pathway, by fusing deep-layer features with shallow-layer features, the positional information from shallow layers is preserved, and the semantic information from deep layers is propagated downward. The bottom-up path effectively transfers the target location information of shallow features upward, enabling the feature maps derived from the network to encompass abundant location and semantic information. Since feature maps at different scales focus on objects of different sizes and contribute differently to feature fusion, BiFPN introduces bidirectional cross-scale connections and fast normalized fusion for weighted feature fusion to learn feature maps. Its calculation formula (1) is as follows:
(1)
where O is the output feature map, Ii is the i-th input feature map, wi is the learnable weight of the i-th input feature map, and is the weight sum of all input feature maps. The ReLU activation function is applied to to ensure that the value is greater than or equal to 0, and ε is set to 0.0001 to avoid gradient vanishing caused by the denominator being 0. After fast normalization, the weight range of the feature map is (0, 1) . Taking Pout3 as an example, the output feature map is derived through Eqs. (2) and (3) :
(2)

(3)
where Ptd3 and Ptd4 are the intermediate feature maps of the 3rd and 4th feature layers in the top-down feature fusion path, Pout3 and Pout2 are the output feature maps of the3rd and 2nd feature layers in the bottom-up feature fusion path. Resize represents the up-sampling or down-sampling operation, w is the connection weight between the feature maps, and DSConv is the depthwise separable convolution. The primary goal of employing DSConv is to lower the network's parameter count.

Fig.5Structure diagram of BiFPN
2.3 Efficient Multi-Scale Attention
Since the apple planting environment is typically outdoors, in most of the apple leaf disease images gathered during the experiment, complex backgrounds obstruct disease recognition and impair the model's capacity to detect smaller disease details. In this work, an Efficient Multi-scale Attention[20] (EMA) module is integrated to strengthen the model's robustness against background interference, allowing it to focus more effectively on disease regions. The network architecture is illustrated in Fig.6.
EMA reorganizes a portion of the input feature map's channel dimension into the batch dimension by grouping features. The input feature map is evenly divided into G sub-feature maps along the cross-channel dimension, ensuring that the spatial semantic features are evenly distributed among the sub-features, which can be represented as , with G≪C generally being used.The symbol “//” in “C//G” indicates rounding down.

Fig.6Structure diagram of EMA
The multi-scale parallel network substructure is used to process sub-features, model cross-channel information interaction, and derive attention weight descriptors from the sub-feature maps. The parallel structure labels the subnetwork where the X Avg pool path and Y Avg pool path are located as 1×1 branches, and the subnetwork where the3×3 convolution operation is located as 3×3 branches. The1×1 branches encode the channels using one-dimensional global average pooling along the X-axis and Y-axis directions of the sub-feature map. The encoded channel information is concatenated along the Y-axis direction of the sub-feature map using a shared 1×1 convolution. The output from the1×1 convolution is directly split into two feature tensors, and two channel attention maps are generated using the Sigmoid activation function. The original input sub-feature maps are then aggregated by multiplication to produce the output tensors of the1×1 branches. In order to facilitate cross-channel interaction between the two pathways in the1×1 branch, the model focuses on important channel features, minimizing the loss of significant channel information. The3×3 branch employs a single3×3 convolution to capture diverse receptive fields, thereby boosting the ability to extract multi-scale features.
EMA uses cross-spatial learning to effectively model short-distance and long-distance dependencies between features. The output tensor of the1×1 branch, after group normalization, is encoded using two-dimensional global average pooling, and the output tensor O1∈R11× (C//G) is activated by the Softmax function.The matrix dot product of O1 and the3×3 branch output tensor O3∈R3 (C//G) ×HW, after the dimensional reshaping process, it produces the initial spatial attention map, capturing spatial information at various scales. The3×3 branch output tensor is activated by the Softmax function after 2-D global average pooling to produce the output tensor T3∈R31× (C//G) . The 1×1 branch output tensor T1∈R1 (C//G) ×HW is dimensionally reshaped and then multiplied by T3 using the matrix dot product to generate a second spatial attention graph, collecting more precise spatial information. The two generated spatial attention maps are added, then activated using the Sigmoid function, and finally, it is multiplied by the original sub-feature map, yielding the EMA's final output. Cross-space learning integrates global context information and local features, enabling EMA to highlight global context relationships while capturing pixel-level correlations.
2.4 MobileNetV3-Small Model Improvement
To improve MobileNetV3-Small's feature extraction performance, this study uses MobileNetV3-Small as the backbone network and presents an enhanced MobileNet-BiFPN-EMA model that integrates BiFPN with EMA. Its structure is shown in Fig.7.

Fig.7Structure diagram of MobileNet-BiFPN-EMA
Since nine disease categories need to be identified in this paper, the classification output number of the MobileNetV3-Small model is set to 9 in the improved model.The size of the feature map output by the Bneck11 module is reduced to a pixel size using adaptive average pooling to obtain the feature map C5, which is input to the BiFPN module through lateral connections with C1 to C4.
To strengthen the local texture extraction capabilities within the BiFPN module, a 3×3 group convolution with reduced parameters is employed for lateral connections to consistently adjust the channel dimension of the C1 through C5 feature maps extracted from the backbone network to 64. Since the EMA module adopts cross-channel information interaction and captures pixel-level fine-grained features in the image through cross-space learning, the model can more precisely pinpoint various feature locations of the target object, reducing the impact of background interference. Therefore, the EMA module is added after the output features of the BiFPN module, from maps Pout1~Pout4, to further augment the model's capability to distinguish disease regions from intricate backgrounds across multiple scales, thereby enhancing overall performance.
3 Experiment and Result Analysis
3.1 Experimental Environment and Parameter Setting
The experiment ran on a64-bit Windows 10 operating system, utilizing the PyTorch 2.3.1 deep learning framework, the CUDA 11.8 parallel computing architecture, and the cuDNN 8.7.0 library. PyCharm served as the development environment, and Python 3.8.18 was used as the programming language. The CPU model was Intel© CoreTM i5-10200H CPU @ 2.40GHz, and the GPU model was GTX 1650.
The initial learning rate was set to 0.001, with the model trained using the Adam optimizer. The batch size was established at 16, the input image size was adjusted to 224×224, and the number of training epochs was specified as 160. The dropout layer's deactivation rate was set to 0.2 to prevent overfitting.
3.2 Experimental Results of the Improved Model
The confusion matrix is used to better present the model's test results, as shown in Fig.8. The row labels of the confusion matrix represent the true labels of the samples, while the column labels represent the predicted values of the model for the samples. The diagonal values in Fig.8 indicate the number of correct identifications of each disease category by the model. From the confusion matrix, it can be seen that the improved model has lower recognition accuracy for leaf spot disease and gray spot disease but higher recognition accuracy for other diseases. Specifically, the improved model misidentifies 6 leaf spot disease samples as rust, 4 leaf spot disease samples as gray spot disease, and 5 gray spot disease samples as leaf spot disease. This may be due to the similar contours and sizes of the three diseases, which increase the likelihood of misidentification by the model.

Fig.8Confusion matrix of the predictions by the improved model
3.3 Ablation Experiment
To assess the effectiveness of the proposed improvements, the BiFPN module is incorporated into the MobileNetV3-Small network, creating the MobileNet-BiFPN model. To further evaluate the effectiveness of the EMA Attention Mechanism proposed in this paper, the SE Attention Mechanism and the Coordinate Attention (CA) Mechanism were substituted for the EMA Attention Mechanism in the MobileNet-BiFPN-EMA model, resulting in the MobileNet-BiFPN-SE model and the MobileNet-BiFPN-CA model. Table3 displays the test results for each model, where the inference time denotes the duration required for the model to process a single image. After adding the BiFPN module to the benchmark model, the model's accuracy improves by 0.58 percentage points, with the average recall and average F1-score rising by 2.08 and 0.22 percentage points, respectively. The model's parameter count increases by 0.19 M and model size increases by 0.82 MB, which is relatively small and can still handle the memory limitations of agricultural electronic equipment. The FLOPs of the model increased by 41.6 M, and the inference time increased by 6.96 ms.
Table3Results of the ablation experiments

The MobileNet-BiFPN-EMA model achieves an accuracy of 95.98%, surpassing the MobileNet-BiFPN model by 0.31 percentage points. The average recall and average F1-score increase by 2.01 and 0.83 percentage points, respectively, demonstrating that the EMA attention mechanism enhances model accuracy without adding extra parameters. Compared with the MobileNetV3-Small model, the MobileNet-BiFPN-EMA model experienced an increase of 61.98 M in FLOPs, a9 ms rise in inference time, and a 0.84 MB increase in model size.
Compared with the improved model using the SE attention mechanism and the CA attention mechanism, the improved model using the EMA attention mechanism has fewer parameters and a smaller model size. Additionally, the model accuracy is increased by 0.38 percentage points compared with the SE attention mechanism and by 0.2 percentage points compared with the CA attention mechanism. This indicates that the EMA attention mechanism is more suitable for the recognition task than the SE and CA attention mechanisms.
The loss curve of the training set for the model before and after improvement is shown in Fig.9. The training loss value of the improved model decreases the fastest in the early stages and is lower than that of the MobileNetV3-Small and MobileNet-BiFPN models in the later stages, gradually stabilizing at 0.13.
This paper uses Grad-CAM technology to visualize the improvements in the model. Grad-CAM creates a heatmap by merging gradient information with feature maps from convolutional neural networks, enabling a clear identification of the image regions the model focuses on during decision-making. The redder the color of a region in the heatmap, the more attention the model gives to that region. The bluer the region, the less attention the model gives to it[21].Fig.10 displays the heatmap both before and after the model improvements. As shown in Fig.10, when the feature distribution of leaf diseases is more dispersed, the improved model can capture the disease region more completely; when the feature distribution of leaf diseases is more concentrated, the improved model can locate the disease feature region more accurately, mainly because the BiFPN module improves the model's capacity to capture features at multiple scales. The EMA attention mechanism enables the model to precisely focus on key disease features.

Fig.9Comparison of the training set loss curves of the model before and after improvement
3.4 Comparative Experiment
To assess the improved model's performance in recognizing apple leaf diseases, eight lightweight neural network models, including ShuffleNet_v2_1.5×, MobileNetV2, ResNet50, MobileNetV3-Large, EfficientNet-B0, MobileNetV3-Small, MobileNetV4-Conv-Small and MobileNetV4-Conv-Medium, were selected for comparison experiments. Each model was trained on the same training dataset for 160 epochs with identical hyperparameter settings and in the same hardware environment. Table4 presents the experimental results for the test set. As shown in Table4, the accuracy of all comparison models exceeded 94%, with EfficientNet-B0 achieving the highest accuracy at 95.90% among the eight models. However, this model exhibited higher FLOPs and model size.The FLOPs of the benchmark model MobileNetV3-Small were14.87% of those of EfficientNet-B0, and its model size was 38.10% of that of EfficientNet-B0, while achieving an accuracy of 95.05%. These results demonstrate that MobileNetV3-Small delivers high accuracy with lower resource consumption. The improved model achieves the highest accuracy of 95.98%, with lower parameter count, FLOPs, inference time, and model size compared with all other models, except the benchmark model. Furthermore, the improved model surpasses the other models in average accuracy and F1-score. When compared with MobileNetV2, it reduces the parameter count and FLOPs by 23.21% and 62.25%, respectively. When compared with EfficientNet-B0, it achieves reductions of 57.21% and 70.07%, respectively, demonstrating its suitability for memory-constrained device environments. Furthermore, the model's inference time is only 14.10 ms, this allows for the swift and precise identification of apple leaf diseases.

Fig.10Comparison of the heatmaps before and after model improvement
Table4Results of the contrast experiments

3.5 Data Augmentation Experiment
Plant disease identification often encounters the challenge of limited sample data. This study employs six online data augmentation techniques—random cropping, brightness and contrast adjustment, random rotation, Gaussian noise addition, horizontal flipping, and vertical flipping—to process the original images in multiple ways, simulating different acquisition conditions helps increase the diversity of training samples. These augmentations enable the model to learn more comprehensive features during training, thereby improving its generalization ability.To evaluate the effectiveness of data augmentation, we compared the experimental results of the benchmark model MobileNetV3-Small and the improved model MobileNet-BiFPN-EMA using non-augmented training datasets with those obtained from augmented training datasets. Fig.11 displays the loss curves of the models during training, while Table5 presents the results from the test set.As shown in Fig.11, the loss curves of both the benchmark and improved models in the augmented dataset exhibit faster convergence during the first 40 epochs, smoother convergence in later stages, and lower loss values compared to the unaugmented dataset, which shows slower convergence.As shown in Table5, the use of augmented data results in a4.56 percentage point improvement in the accuracy of the benchmark model. Additionally, the average accuracy, recall, and F1-scores improve by 6.56, 5.82, and 6.29 percentage points, respectively. The enhanced model shows a3.16 percentage point increase in accuracy, with average accuracy, recall, and F1-scores improving by 5.16, 4.49, and 4.83 percentage points, respectively. These results demonstrate that data augmentation significantly enhances the model's disease identification capability.

Fig.11Comparison of the training set loss curves of the model before and after improvement
Table5Classification results of data augmentation experiment

4 Conclusions
This paper presents a lightweight, improved model—MobileNet-BiFPN-EMA—based on the MobileNetV3-Small architecture, aimed at enhancing the recognition accuracy of apple diseases. By incorporating the BiFPN module and the EMA attention mechanism, the model effectively leverages both shallow and deep features, thereby improving its feature extraction performance. Comparative experimental results reveal that, when tested on the AppleLeaf9 dataset, the improved model outperforms the original MobileNetV3-Small model, with increases of 0.93, 0.29, 1.73, and 1.05 percentage points in accuracy, average precision, average recall, and average F1-score, respectively. Furthermore, the model's parameter count and size are only 1.72 M and 7.07 MB, respectively, indicating an optimal balance between model size and recognition performance, making it highly suitable for memory-constrained environments. Ablation experiment showed that the improved model better identified scattered disease features after the BiFPN module was integrated, and the identification of disease features was further improved by adding the EMA attention mechanism. Data enhancement experiment showed that recognition accuracy could be effectively improved by using the enhanced training set.
This study has several limitations:
1) First, the model encounters difficulties in accurately identifying three specific diseases: spot leaf drop, gray spot disease, and rust disease. Additionally, the dataset is limited in the diversity of apple leaf diseases it includes. To address these limitations, we plan to expand the dataset by collecting disease images from both online and offline sources.
2) While the current dataset contains only one type of crop disease, future efforts will focus on acquiring images of various crop diseases from natural environments, which will enhance and validate the model's generalization capability across different agricultural diseases.
3) The size and computational complexity of the improved model can still be further reduced. In subsequent stages, optimization techniques, including model pruning and quantization, will be explored to refine the model. Additionally, the optimized model will be deployed in agricultural electronic equipment to better identify and address challenges encountered in production practices.