Hardware configuration

The primary hardware is a dataglove consisting of three units, namely sensing, processing, and onboard power regulation unit. The sensing unit is comprised of five 2.2″ flex sensors (SEN-10264) and an IMU (MPU-6050) which has a triaxial accelerometer and a triaxial gyroscope. The overall hardware configuration is illustrated in Fig. 1.

Figure 1 The dataglove architecture: On the left, we have the glove with all the mounted sensors and electronics. A flex sensor is shown in the top right corner. The components of the main controller board are shown in the bottom right corner. It consists of an ESP32 microcontroller, an MPU-6050 IMU, and some complementary electronics.

Sensing unit

The flex sensors are, in fact, variable resistors with flat resistance of \(25\;{\text{K}}\Omega \;\left( { \pm \;30\% } \right)\), which are placed above the five fingers of the dataglove using fabric pockets to sense the fingers’ flex. A voltage divider was created with each flex sensor and a \(0.25 \;{\text{W}}\;100\) KΩ (\(\pm \;5\% )\) resistor was used to convert the resistance difference during the finger flexion to the voltage difference across the sensor using the processing unit51.

The accelerometer and gyroscope of the IMU are configured to track the linear acceleration within \(\pm \;19.6 \;{\text{ms}}^{ – 2}\) and angular velocity within \(\pm \;4.36\;{\text{rad}}\;{\text{s}}^{ – 1}\), respectively, which is well within the range of any human hand motion. Moreover, the IMU contains a Digital Motion Processor (DMP) which can derive the quaternions in-chip from the accelerometers and gyroscope data and thus, provides the hand orientation data along with the motion information52.

Processing unit

The processing unit is a WiFi-enabled development module called DOIT ESP32 Devkit V1 that has a Tensilica Xtensa LX microprocessor with a maximum clock frequency of \(240\;{\text{MHz}}\). The 12–bit analog to digital converter (ADC) with 200-kilo samples per second maximum sampling rate is capable of sampling the flex sensors’ analog data with sufficient resolution. Moreover, the module is capable of communicating with external computers via USB which enables wired data communication53.

Onboard power regulation

The ESP32 module and the IMU have an operating voltage of \(3.3\;{\text{V}}\)52,53. On the other hand, the flex sensors do not have a strict operating voltage51. Hence, we used an LM1117 low-dropout (LDO) 3.3 V linear voltage regulator to regulate the supply voltage from the \(3.7\;{\text{V}}\) single cell LiPo battery. Moreover, we used \(10\) ﻿μF and \(100\) μF filtering capacitors to filter out the supply noise.

Dataset

Overview

We explored 40 signs from the standard ASL dictionary that including 26 letters and 14 words. Among these signs, 24 require only a certain finger flexion and no hand motion; hence, are addressed as static signs or gestures. Conversely, the remaining 16 signs need hand motion alongside finger flexion to portray meaningful expression according to the ASL dictionary. Moreover, we collected the signs from 25 subjects (19 Male and 6 Female) in separate data recording sessions with a consistent protocol. Overall, three channels for acceleration in both body and earth axis, three for angular velocity, four for quaternion, and five for flex sensors were recorded in the dataset.

The data was recorded by the dataglove processing unit which was connected to a laptop for data storage via USB. The sampling frequency is set to 100 Hz and each gesture was repeated 10 times to record the performance variabilities of each subject. However, during a few sessions denoted in the dataset supplementary information, the laptop charger was connected which resulted in AC-induced noise all over those specific recorded data.

Data recording protocol

Before starting the recording process, each subject signed an approval form for the usage of their data in this research and was briefed about the data recording steps. As the subjects were not familiar with the signs before the study, they were taught each sign before the data recording via online video materials54. The data was recorded by the dataglove and stored on the laptop at the same time. Hence, a Python script was used on the laptop to make the handshake between the two devices and to store the data in separate folders as per the signs and the subjects.

At the beginning of each data recording session, the subjects were prompted to declare their subject id and the gesture name. Afterward, a five-second countdown is prompted on the laptop screen for preparation. Each instance of the gesture data is recorded for a 1.5 s window and the subjects can easily perform their gesture once within that window. In a single gesture recording session, this process is repeated 10 times. The gesture recording flow for each session is shown in Fig. 2. All methods were carried out following the relevant guidelines, and the correctness of gestures was evaluated by visual inspection. All experimental protocols were approved by the University of Dhaka, Dhaka, Bangladesh. Note that informed consent was obtained from all subjects.

Figure 2 The flowchart showing the data collection protocol. The diagram shows all the different steps of the data collection process. This protocol was followed during the data collection for all the subjects.

Data preprocessing

Gravity compensation

The triaxial accelerometer of the IMU body records acceleration, which is subjected to gravity. Hence, the gravity component has to be adjusted from the recorded raw acceleration to interpret the actual motion characteristics of the dataglove. The gravity vector can be derived from the orientation of the dataglove. Quaternions express the 3d orientation of an object which is a robust alternative to the Euler angles which are often affected by gimbal-lock55. The digital motion processor (DMP) of the MPU-6050 processes the raw acceleration and angular velocity internally and produces quaternion. The quaternions can be expressed by Eq. (1).

$${\varvec{Q}} = q_{w} + {\mathbf{q}} = q_{w} + q_{x} \hat{i} + q_{y} \hat{j} + q_{z} \hat{k}$$ (1)

where \({\varvec{Q}}\) stands for a quaternion that contains a scaler, \({q}_{w}\) and a vector, \(\mathbf{q}\left({q}_{x},{q}_{y},{q}_{z}\right)\). The overall gravity compensation process is described in Eqs. (2) and (3)56.

$$\left[ {\begin{array}{*{20}c} {g_{x} } \\ {g_{y} } \\ {g_{z} } \\ \end{array} } \right] = \left\| g \right\|\left[ {\begin{array}{*{20}c} {2(q_{x} q_{z} – q_{w} q_{y} )} \\ {2(q_{w} q_{x} + q_{y} q_{z} )} \\ {q_{w}^{2} – q_{x}^{2} – q_{y}^{2} + q_{z}^{2} } \\ \end{array} } \right]$$ (2)

$$\left[ {\begin{array}{*{20}c} {la_{x} } \\ {la_{y} } \\ {la_{z} } \\ \end{array} } \right] = \left[ {\begin{array}{*{20}c} {a_{x} } \\ {a_{y} } \\ {a_{z} } \\ \end{array} } \right] – \left[ {\begin{array}{*{20}c} {g_{x} } \\ {g_{y} } \\ {g_{z} } \\ \end{array} } \right]$$ (3)

where \({\varvec{g}}\left({g}_{x},{g}_{y},{g}_{z}\right)\), \({\varvec{Q}}\left({q}_{w}, {q}_{x},{q}_{y},{q}_{z}\right),\) \({\varvec{l}}{\varvec{a}}\left({la}_{x},{la}_{y},{la}_{z}\right)\), and \({\varvec{a}}\left({a}_{x},{a}_{y},{a}_{z}\right)\) denotes the gravity vector, quaternion, linear acceleration vector, and raw acceleration vector, respectively. The resultant linear acceleration (\({\varvec{l}}{\varvec{a}}\)) represents the body axis acceleration which is compensated for the gravity offset. This step was done in the processing unit of the dataglove.

Axis rotation

The recorded raw acceleration and the gravity-compensated linear acceleration both were in the body axis of the dataglove and the body axis is dependent on the initial orientation of the dataglove when it powers up. However, this nature of axis dependency on the initial orientation is problematic for real-world applications. Hence, we converted the triaxial acceleration vector from the body axis to the North-East-Down (NED) coordinate system which follows the directions based on the earth itself57. At first, a rotation matrix was calculated using the quaternions. Afterward, the NED linear acceleration is derived using matrix multiplication between the rotation matrix and the body axis linear acceleration. Equations (4) and (5) show this axis transformation process using quaternions58.

$${\mathbf{R = }}\left[ {\begin{array}{*{20}c} {1 – 2(q_{y}^{2} + q_{z}^{2} )} & {2(q_{x} q_{y} – q_{w} q_{z} )} & {2(q_{x} q_{z} – q_{w} q_{y} )} \\ {2(q_{x} q_{y} + q_{w} q_{z} )} & {1 – 2(q_{x}^{2} + q_{z}^{2} )} & {2(q_{y} q_{z} – q_{w} q_{x} )} \\ {2(q_{x} q_{z} + q_{w} q_{y} )} & {2(q_{y} q_{z} + q_{w} q_{x} )} & {1 – 2(q_{x}^{2} + q_{y}^{2} )} \\ \end{array} } \right]$$ (4)

$$\left[ {\begin{array}{*{20}c} {LA_{x} } \\ {LA_{y} } \\ {LA_{z} } \\ \end{array} } \right] = {\mathbf{R}}\left[ {\begin{array}{*{20}c} {la_{x} } \\ {la_{y} } \\ {la_{z} } \\ \end{array} } \right]$$ (5)

where \(\mathbf{R}\), \({\varvec{Q}}\left({q}_{w}, {q}_{x},{q}_{y},{q}_{z}\right)\), \({\varvec{L}}{\varvec{A}}\left({LA}_{x},{LA}_{y},{LA}_{z}\right)\), and \({\varvec{l}}{\varvec{a}}\left({la}_{x},{la}_{y},{la}_{z}\right)\) stands for the rotation matrix, quaternion, NED linear acceleration, and the body axis linear acceleration, respectively. Similar to the previous step, this axis transformation is also done in the processing unit of the dataglove. Figure 3 illustrates the axial diagram of the dataglove and the axis rotation.

Figure 3 The IMU orientation diagram: On left, we have the X, Y, and Z coordinates of the MPU-6050. Along these 3 axes, the accelerometer and gyroscope values are recorded. The figure on the right shows the body axis to earth axis conversion diagram.

Rolling filters

After closer inspection, we found a few random spikes in the IMU data. Hence, firstly, we removed using a rolling median filter of 10 data points to get rid of such spikes. After the spike removal, secondly, we used an extra step of applying moving average filters for the only specific sessions where the recordings were subjected to AC-induced noise which resulted in comparable waveforms for all data recordings. The implementation of the moving average filter is shown in Eq. (6)59:

$$y\left[ n \right] = \frac{1}{N}\mathop \sum \limits_{k = 0}^{N – 1} x\left[ {n – k} \right]$$ (6)

where \(x\left[n\right]\) is the input signal, \(N\) stands for the number of data points, and \(y\left[n\right]\) denotes the output signal. However, after applying the rolling average there were a few null values at the end of each signal frame which were replaced by the nearest values in that signal. According to the data recording protocol, the gestures were performed in the middle of each 1.5-s window. Hence, replacing the few terminal data points with the nearest available valid data point does not change the signal morphology. Lastly, we used another level of rolling average filter of 10 data points, this time for the whole dataset, to further smooth the signal and also replaced the terminal null values with the nearest valid data point in each frame.

Normalization

The processed acceleration and flex sensor data are not in the same range. Hence, before employing the AI-based classification technique, data normalization is widely practiced for better convergence of the loss function60. We used min–max scaling as the normalization technique with a range of \(\left[ {0,1} \right]\). It is shown in Eq. (7)61:

$$x_{normalized} = \frac{{x – x_{\min } }}{{x_{\max } – x_{\min } }}$$ (7)

where \(x\) is the input and \({x}_{normalized}\) is the normalized output. \({x}_{\mathrm{max}}\) and \({x}_{\mathrm{min}}\) respectively denote the maximum and minimum values of the input.

Spatial projection images generation

There are several challenges associated with dynamic sign language recognition. In our case, the temporal dependency and the size of the hand were the most challenging issues. A signer can perform a sign at many different speeds. Moreover, the speed does not match up from signer to signer. To successfully recognize signs from all the subjects, first, this temporal dependency needs to be removed from the signals. The second challenge was the hand size of the signer which introduced variability in the gestures performed by different signers. In the proposed method, we tried to eliminate these two issues by utilizing the Spatial Projection Images of the dynamic gestures. However, the static gestures do not generate a meaningful pattern in the projections due to their stationary nature. Hence, this step is omitted for static signs.

When interpreting a sign, the speed of performing the sign and the signer’s hand size does not matter. The spatial pattern created by the motion of the signer’s hand defines the sign. As long as the pattern is correct, the sign will be considered valid regardless of its temporal and spatial states. To capture this pattern of sign language gestures we utilized the accelerometer sensor data from our device. Using Eqs. (8–9), we converted the 3D acceleration into 3D displacement vectors. These vectors represent the path followed by the hand in 3D space during the performance of the gesture.

$$\mathop \smallint \limits_{{t_{1} }}^{{t_{2} }} a\left( t \right)dt = v\left( {t_{2} ) – v(t_{1} } \right)$$ (8)

$$\mathop \smallint \limits_{{t_{1} }}^{{t_{2} }} v\left( t \right)dt = x\left( {t_{2} ) – x(t_{1} } \right)$$ (9)

These 3D displacement vectors were then projected onto the XY, YZ, and ZX 2D planes. If the vectors are projected onto these planes for the entire timeframe of the sign, the projections form a 2D path that captures the pattern of the sign in the 3 planes as shown in Fig. 4. No matter at which speed the gesture was performed, these 2D projections of the gesture always provide similar patterns. Hence the temporal dependency is eliminated in this process.

Figure 4 Spatial projection generation process. We start with the 3-axis acceleration and then convert them into 3-axis displacement vectors. These vectors are projected onto the 2D spatial planes to generate the projection images.

After capturing the pattern of a particular gesture, we normalize the projections using the maximum and minimum values along axes. In this way, the projection from different signers results in a pattern that is similar regardless of their hand size.

The projections were generated using the Python Matplotlib62 library where the components of the displacement were calculated along the 3 axes and they were plotted 2 at a time for the three-axis planes (XY, YZ, and ZX). We used the line plot for this with the “linewidth” parameter set to 7 and the color of the line set to black. This resulted in 3 grayscale images for the 3 projection planes for each gesture. The images were then resized to 224 × 224 pixels dimensions and we used these images for the input of our proposed model.

The proposed architecture

In this section, we present the network architecture of our proposed framework (Fig. 5). We have used two variations of the architecture for static and dynamic signs.

Figure 5 Proposed network architectures: (a) Overall diagram of the proposed architecture. For static gestures, the sensor channels are processed by parallel 1D ConvNet blocks. For dynamic gestures, the accelerations are first converted into spatial projection images and features are extracted from them using the pre-trained MobileNetV2 network, (b) the architecture of the 1D ConvNet Blocks, and (c) the architecture of MobileNetV2.

Architecture for static gestures

As mentioned in the Data Preprocessing subsection, Spatial Projection Images are not used for static gestures. The normalized time series channels are passed to separate 1D ConvNet blocks to produce embeddings. These embeddings are afterward concatenated in a fully connected layer which in turn, makes the prediction. Figure 5a shows the stacked 1D ConvNet block architecture for static gesture detection.

Architecture for dynamic gestures

We have utilized two different types of signals for the input to our model. First, we have the 3 spatial projection images generated from the acceleration data. Then we also have the 1D time-series signals from the flex sensors. So, in total, we have 8 channels of input data with 3 image channels and 5 time-series signal channels. Each of these channels was processed using separate ConvNet blocks to produce the embeddings from that particular channel. For the static gestures, the 8 time-series signals were processed using the parallel path ConvNet architecture shown in Fig. 5b. On the other hand, the projection images were processed by a 2D ConvNet architecture (MobileNetV263) as shown in Fig. 5c. The architectural details of these two ConvNet blocks are discussed below.

1D ConvNet block

The 1D ConvNet blocks are composed of 4 convolution layers. Each pair of convolution layers is followed by a BatchNormalization layer and a MaxPooling layer. The kernel size used in the convolution layers was set to 3, the stride was set to 1 and the padding was set to 1. The MaxPooling kernel size was set to 2 and the ReLU activation function was used. After the 4 convolution layers, the fully-connected layer with 50 neurons was used to extract the embeddings.

2D ConvNet block

The 2D ConvNet blocks are constructed using the MobileNetV264 architecture. MobileNet is an efficient architecture for mobile and embedded vision applications. It utilizes depthwise separable convolutions65 to significantly reduce the computational burden compared to regular convolution. In depthwise separable convolution, each of the channels is processed with the convolution filters separately and the resultants are combined using a 1 × 1 pointwise convolution. This is known as factorization and it drastically reduces the computation and model size.

The MobileNetV263 is the result of the improvements done to the regular MobileNet architecture. It uses an inverted residual structure66 where the skip connections are between the thin bottleneck layers which improves the performance compared to the classical structure. The MobileNetV2 architecture starts with a regular convolution layer with 32 filters followed by 19 residual bottleneck layers. The kernel size was set to 3 × 3 and ReLU664 was used as the activation function.

We used the Tensorflow67 Python library to implement the proposed network. For the loss function, we used the Sparse Categorical Cross-Entropy loss. The loss was minimized using the Adam68 optimizer with a learning rate of 0.0001. The network was trained for a maximum of 300 epochs with an early stopping criterion set on the validation loss with a tolerance of 30 epochs.

Ethical approval

We took written consent from all the subjects participating in the data collection process. It was mentioned in the consent form that the data will only be used for research purposes. Moreover, the dataset does not contain any personal information of the subjects but their sex and age information.