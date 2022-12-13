Model selection

Various machine learning methods or models may be used to solve numerous classification, clustering, and regression problems. The current challenge is that whichever model and hyperparameter combinations would function better upon the particular dataset. The optimization algorithm in this scenario contains several learning algorithms (models) and hyperparameters. One needs to produce many hyperparameter combinations to maximize predictive accuracy and obtain the optimal collection of hyperparameters. Next, the one that yields the best predictive precision may be achieved by exploring hyperparameter combinations. Grid search may be employed to detect the optimal collection of hyperparameters by searching across all possible permutations. The sklearn library’s “GridSearchCV” function can be utilized to connect linear search through hyperparameters. The sets of all hyperparameters to be adjusted are handed to GridSearchCV. The GridSearchCV develops a design based on the optimum hyperparameter combination for the incoming and outgoing parameters31. In this study, seven mentioned models are used, which their brief explanations are presented first. The models are Random forest, support vector machine (SVM), gradient boosting, extra trees, extreme gradient boosting (XGB), and ANN (MLP, RBF), respectively.

Isolation forest

This model can be a proficient calculation for outlier detection. The calculation builds an Irregular Forest in which each Chosen Tree is developed arbitrarily; at each node, it picks a feature at random; at that point, it picks an arbitrary limit value (between the minimum and maximum values) to part the dataset in two sections. The dataset slowly gets chopped into pieces this way until all occurrences are separated from one another. Inconsistencies are ordinarily distant from other instances, so on regular (overall the Chosen Trees), they tend to urge separated in fewer steps than typical instances.

Support vector machine (SVM) regression

SVM is a training machine learning technique that may be utilized for classification and regression tasks. In contrast to many ML algorithms, during which the goal is to minimize the cost function. The primary goal of SVM seems to be maximizing the margin among support vectors via a separating hyperplane32. It covers not only linear and nonlinear classification but also covers linear and nonlinear regression. The secret to using SVMs for regression rather than classification is to reverse the goal. In this work, to do SVM Regression, the SVR class from the SVM module from scikit-learn API was used.

Random forest

Random Forest is a simple machine learning algorithm that typically generates excellent results even when its meta-parameters are not adjusted. This algorithm is among the most extensively employed ML algorithms for both “Regression” and “Classification” because of its simplicity and applicability. The random forest algorithm starts by dividing the input features into subsets that form a tree; then, a proper fitting function is developed for each decision tree that works on the random features picked. A random forest model is built at the end of the training procedure. It is worth noting that every tree is built from randomly chosen input vectors during the training process, namely “random” forest33. For implementing this model, the RandomForestRegressor class from the ensemble module in the scikit-learn API was employed. Figure 5 illustrates a schematic of how the random forest model works.

Figure 5 Schematic diagram of random forest procedure.

In Fig. 3, \({\hat{\text{r}}}\left( {X,V} \right)\) is the representative tree at the end of the training phase, X is the set of input feature vectors, T is the collective set indicating the input–output pair V i = (x 1 , y 1 ), (x 2 , y 2 ),…(x n , y n ), and k is the number of trees.

Extra trees regressor

Extra trees are a supervised machine learning technique comparable to the random forest and can be harnessed for regression and classification. In a Random Forest, just a random subset of the features is considered for splitting at every node. Instead of searching for the best possible thresholds, trees can be made even more random by applying random thresholds for every feature. A forest of such highly random trees is named an extremely randomized trees ensemble. Such a strategy trades more bias for less variance. Also, it makes extra-trees significantly quicker to train than standard Random Forests since one of the most time-consuming aspects of tree growth is detecting the optimum threshold for every feature at each node34.

Gradient boosting

Gradient boosting is an ensemble supervised ML method that may be utilized for regression and classification. The term “ensemble” refers to methods, like random forest, extra trees, gradient boosting, that builds an ultimate model according to various individual models. Gradient boosting trains several models sequentially by assigning greater weights to examples with incorrect predictions. As a result, tough instances are the focus of training. Gradient boosting is used in sequential model training to gradually reduce a loss function. This function will be minimized in the similar way as an ANN model35. GBR provides several advantages, remarkedly strong prediction accuracy and stable output. The additive training mechanism of the boosted model may be represented in a forward linear way as:

$$\begin{aligned} & \hat{y}^{\left( 0 \right)} = 0 \\ & \hat{y}^{\left( 1 \right)} = vf_{1} \left( {x;\Theta_{1} } \right) = \hat{y}^{\left( 0 \right)} + vf_{1} \left( {x;\Theta_{1} } \right) \\ & \hat{y}^{\left( 2 \right)} = v\mathop \sum \limits_{j = 1}^{2} f_{j} \left( {x;\Theta_{j} } \right) = \hat{y}^{\left( 1 \right)} + vf_{2} \left( {x;\Theta_{2} } \right) \\ & \ldots \\ & \hat{y}^{\left( T \right)} = v\mathop \sum \limits_{j = 1}^{T} f_{j} \left( {x;\Theta_{j} } \right) = \hat{y}^{{\left( {T – 1} \right)}} + vf_{T} \left( {x;\Theta_{T} } \right) \\ \end{aligned}$$ (2)

where T is the number of RTs for boosting; Θ j is the structure of the jth RT; ν is the shrinkage parameter (distinguished by the learning rate that satisfies 0 < ν < 1 for shrinking the contribution of RTs); \({\widehat{y}}^{(j)}\) is the estimation of target variable by first j RTs; and \({f}_{j}\) is the output of the jth RT without shrinkage, which employs predictor variables x to approximate \(y-{\widehat{y}}^{(j-1)}\) (i.e., residuals) with tree structure Θj. As the number of RTs grows, the residuals will normally decrease. Figure 6 depicts a schematic diagram of the Gradient boosting procedure for illustrative purposes36.

Figure 6 Schematic diagram of gradient boosting procedure.

Extreme gradient boosting (XGB)

Tianqi Chen invented extreme gradient boosting, often called XGBoost, as a ML method that may be utilized for regression and classification. XGBoost is a gradient boosting approach that distinguishes from a gradient boosting model in multiple ways: (1) because of the multithreading of tree structures, XGBoost is generally quicker than gradient boosting, (2) because it can accept incomplete data inside a collected data, data preprocessing takes less time37. The XGBRegressor class from the xgboost package was used to implement this model.

ANN-MLP

In the early 1940s, the network technique was utilized to assess and analyze data for many themes, and the ANN structure was applied. Currently, scientists are working to improve understanding of how the human brain works to create the next generation of neuroscientific machine learning38. One of the benefits of the neural network is that it needs less time to solve complicated problems. If there is no specific relation between the data, ANNs, as patterned after the human biological brain, are harnessed to discover one. The neural network has the following characteristics: parallel computing (top intensity), nonlinear calculations, generality, output and input data interchange, adaptability, large data response, error tolerance, and training39. The neural network approach describes as human nerve anatomy. McCulloch and Pitts invented the ANN based on the activity of actual elements of the brain. The analysis process in neural networks is similar to the operations of neurons in human brains40. The functioning of neurons in the human brain is quantitatively represented in ANNs. The terms neural networks (NNs) and ANNs will be used equally henceforth. NNs have two potential applications: Discovering a relationship among a group of quantitative inputs (features) and outputs (target) and clustering. In general, NNs are made up of a set of “Neurons” arranged in a layered architecture. Every input and output variable may correspond to a node, which functions similarly to a real neuron. Nodes are organized into layers in which input and output layers are linked. The number of hidden layers and the number of nodes per each that link the input to the output layer are specified by the architecture of NNs. Weights (w ij ) indicate the link among each of the two nodes, where i and j demonstrate nodes in the source and destination nodes, respectively41. The ANN approach is also one of the most extensively utilized techniques in nonlinear applications. This method’s excellent properties include nonlinearity, classification, identification, data analysis, and optimization. In the NN approach, the network design is taught based on experimental data, and all parameters in the network model are optimized to achieve the best result. The target in ANN is to obtain the proper weights (w) for a specific function (f). Every input (x i ) is multiplied by the relavant weight, all quantities are added together, and then the threshold or bias quantity (b) is added to the sum of the quantities. The equation below represents this approach for input data:

$$sum=\left(\sum \limits_{i=1}^{N}{\omega }_{i}{x}_{i}\right)+b$$ (3)

The output quantities, y, are created by feeding the data into a transfer function, f, as given in Eq. (4).

The common transfer functions are step, Relu, LeakyRELU, hyperbolic tangent, and sigmoid (S shape).

Optimization algorithms or optimizers are critical components in improving the performance of a NN They conventionally adjust the hyperparameters of a model based on its design. Hyperparameters that impact an optimizer’s behavior, such as learning rate, control its update rule, determining the optimizer. The integration of hyperparameters and update rule separates any two optimizers. An optimizer must adjust the weights and learning rate of the model’s nodes throughout the training phase to minimize the loss function. To summarize, the primary aim of an optimizer is to minimize training error42. The optimization procedure of the best ANN algorithm is summarized in Fig. 7.

Figure 7 Different stages for optimizing the ANN models.

Overfitting and extended training times are two significant difficulties in multi-layered neural network learning, especially deep learning. Overfitting occurs when a model conducts properly on training data but badly on test data; in other words, the model has low training error but high test error. Regularization is a collection of approaches for decreasing overfitting. Dropout advocated randomly changing the network architecture when overfitting in deep learning to lessen the risks that the learnt weight values are excessively customized to the underlying training data and consequently cannot be generalized properly to test. Dropout simulates model ensembling without the need for several networks43.

Adam optimizer was utilized to solve the network, an algorithm for first-order gradient-based optimization of stochastic objective functions according to adaptive predictions of lower-order moments. This method is simple to advance, computationally effective, needs minimal memory, invariant to gradient diagonal rescaling, and is ideally suitable for issues with immense amounts of data and/or parameters. The hyperparameters have straightforward interpretations and need a slight adjustment in most cases44.

Radial based function (RBF)

The radial based function (RBF) neural network is a feedforward network with an individual hidden layer; also, Broomdhead and Lowe suggested this network for the first time45. The solution of an over-specified set of linear correlations can be solved using some highly stable approaches during the training of RBF networks with pre-determined nonlinearities. The RBF networks have a solid theoretical foundation since they are closely related to the well-studied field of linear models’ regularization theory46. The data from the input layers are gathered from the hidden layer and moved forward the Gaussian transfer function, converting the data into nonlinear functions. The RBF algorithm utilizes nonlinear transfer functions to link the hidden and input layers. The geometrical dimension-based distance between the weights and the output vector is determined by the individual hidden neurons in the network. Equation (6) presents the combiners-based RBF algorithm network output layer in its linear form:

$$f(x)=\sum \limits_{i=1}^{N}{w}_{ij}G\left(\Vert x-{c}_{i}\Vert *b\right)$$ (6)

where N is the number of training data sets, W ij is the weight attributed to every hidden neuron, x is the input vector, c i is the center points, and b is the bias. A Gaussian equation, Eq. (7) can be employed to detect the centralized solution from the hidden point, as follows:

$$G(\Vert x-{c}_{i}\Vert *b)=\mathit{exp}\left((-\frac{1}{2{\sigma }_{i}^{2}}(\Vert x-{c}_{i}\Vert *b{)}^{2}\right)$$ (7)

The Gaussian function’s spread is σ i . This equation is the range of \(\Vert x-{c}_{i}\Vert\) within the input domain to which the RBF neuron can respond. The procedure of choosing neurons in the RBF network is typically according to trial and error, thus the algorithm begins with a considerable number of neurons in the single hidden layer and then is conducted to decrease the number of neurons as much as the minimum MSE.

In this work, Rmsprop optimizer was harnessed to train the network. RMSprop and Adadelta entered the scene concurrently but independently, intending to cope with Adagrad’s diminishing learning rates. RMSprop is a gradient-based optimizer that, rather than treating the learning rate as a hyperparameter, uses an adaptive learning rate that varies over time47.

Error metric

The performance of the models is compared by the following metrics (RMSE, R2, MSE, MAE), and ultimately, the criterion R2 is considered to select the best model.

Mean absolute error (MAE) It is just the mean of the absolute difference between the estimated and actual data, which can be calculated as follows:

$$MAE= \frac{1}{n}{\sum }_{i=1}^{n}\left|{y}_{i}-{\widehat{y}}_{i}\right|$$ (8)

Mean squared error (MSE) As the title implies, it is the mean of the squared errors. MSE can also be taken into account as a loss function that must be decreased. It is often utilized in real-world machine learning applications because greater errors are penalized more when employing MSE as the objective function than when using MAE35.

$$MSE=\frac{1}{n}\sum_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}$$ (9)

Root Mean Square Error (RMSE) RMSE is the square root of MSE35.

$$RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{({y}_{i}-{\widehat{y}}_{i})}^{2}}$$ (10)

Coefficient of determination (R2) It assesses the model’s fitness to the liable, scientific results. The nearer the coefficient of determination (R2) is to 1, the higher the predictions fit the experimental data. R2 is calculated as follows48:

$${R}^{2}=\frac{{\sum }_{i=1}^{n}{\left({Y}_{predicted}-{Y}_{actual}\right)}^{2}}{{\sum }_{i=1}^{n}{\left({Y}_{predicted}-{Y}_{mean}\right)}^{2}}$$ (11)