Instrumentation & Measurement Magazine 26-2 - 20

than ConvNet-based models in modelling local information.
At the same time, the model complexity of ViT is too high,
bringing the problem of low detection efficiency. No researcher
has yet used ViT as a backbone network to improve local detection
of industrial defect images. Therefore, it is important
to improve ViT to improve the defect detection accuracy while
reducing the computational complexity. The DTN model presented
in this paper can effectively solve this problem.
In this article, we fuse pooling attention and window attention
in a novel hybrid attention approach and employ DTN
instead of LN to form a simple and efficient transformer-based
backbone network, called Dynamic Vision Transformer Using
Hybrid Window Attention (DHT) for vision tasks of industrial
image defect classification. It reduces the model complexity
and improves the detection efficiency, while capturing more
local defect information and further improving the detection
accuracy, which has some engineering significance in the field
of industrial defect detection.
The second part describes the main image processing models
and methods currently available in the field of vision.
Subsequently, we describe normalization methods and present
the theory of the attention mechanism module. We then
show two representative datasets of industrial defects and
perform experimental evaluation of metrics and parameter
settings. Experiments on the ablation of the attention module
and normalization methods are presented to compare the
performance parameters with the representative network algorithms
of ConvNet-based models and transformer-based
models. Finally, we summarize the experimental results.
Related Work
ConvNet-based Models
Recently, with the rapid development of deep learning and
the increase in computer performance, new algorithmic models
have emerged. Deep learning-based convolutional neural
networks (CNNs) have made great strides in image processing.
AlexNet proposed by Krizhevsky et al. [4] used a CNN for
image recognition and achieved excellent results on the ImageNet
[8] large-scale image dataset, which attracted wide
attention from researchers. GoogleNet [5] increases the width
and depth of the network while keeping the computing budget
constant, improving the utilization of computing resources
within the network. ResNet [6] proposes a residual learning
framework that reduces the complexity of the model and
shows better generalization capabilities. DensNet [7] connects
each layer to every other layer in a feed-forward manner, alleviating
the problem of vanishing gradients and enhancing
feature propagation while significantly reducing the number
of parameters. In addition to the constructive work mentioned
above, there have been some other efforts [15],[16] to improve
accuracy by improving the structure of the model.
Vision Transformer-based Models
Much enthusiasm has been generated for ViT [9], which splits
the original image into a sequence of patches, which are then
20
fed into transformer encoders, and shows very competitive results
in various visual tasks.
Recently, convolution-based operations have been introduced
into ViT. For example, DeiT [10] introduces a
teacher-student strategy specific to transformers, ensuring
that the student learns from the teacher through attention, typically
from a ConvNet teacher. PVT [11] introduces a pyramid
structure to the transformer and provides a pure transformer
backbone for intensive prediction tasks. Transformer in transformer
(TNT) [12] encodes the input data into powerful
features by means of an attention mechanism. Basically, the visual
converter first divides the input image into several local
blocks and then computes the two representations and their
relationships. CvT [13] introduces convolution into ViT to improve
the efficiency and performance of the original ViT, thus
taking advantage of the strengths of both designs. Swin [14]
proposes a hierarchical transformer whose representation is
computed using a shifted window. The shifted window brings
greater efficiency by limiting attention in local windows.
In contrast to previous work, we adopt DTN to train data
from the perspective of normalization. We show that DTN enables
models to capture both global and local information.
Meanwhile, we adopt a hybrid window attention module to
reduce computational complexity.
Attention Module
Attentional mechanisms can build long-term dependencies
and improve feature representation, and they have become
an effective approach for many visual tasks. The current field
of visual recognition includes three categories: temporal, spatial
and gated attention [17]. SENet [18] adaptively recalibrates
the feature responses of channels by exploring the importance
of each channel across different channels of the convolutional
network, thus explicitly modelling the interdependencies between
channels. Non-local operations [19] use the weighted
sum of all positions in the feature map to calculate the response
of a feature, establishing a contextual dependency between
pixels. Gated Attention Networks [20] use convolutional subnetworks
to control the importance of each attention head.
Self-attention in Swin transformers [14] has a quadratic complexity
with respect to the number of tokens. This problem is
more severe for industrial detection work, as it affects the efficiency
of the inspection.
However, these methods, by modelling global contextual
information and capturing remote long-term dependencies,
lead to significant performance improvements in vision tasks.
For industrial defect image detection tasks where local information
is important, it is not easy to reduce computational
complexity while improving detection accuracy.
In our model, we propose a hybrid window attention
containing pooling attention and window attention. The
computational complexity is reduced by introducing hybrid
window attention. It controls the complexity of self-attention
by reducing the size of query, key and value. Pooling attention
keeping a global self-attention computation by downsampling
features through local aggregation. While window attention
IEEE Instrumentation & Measurement Magazine
April 2023
Instrumentation & Measurement Magazine 26-2

Table of Contents for the Digital Edition of Instrumentation & Measurement Magazine 26-2