IBM SPSS Modeler is a leading visual data science and machine learning solution. This enables users to mine data and modern applications with complete algorithms and models ready to use immediately. Figure 4 shows the steps performed on IBM SPSS modelers and how each step is performed, and the result will be explained in the next sections.

Figure 4 Steps performed on IBM SPSS modelers for the prediction of the external corrosion rate of carbon steel in soil.

Data-preparation

Data preparation consists of three steps which are to collect data, classify data, and divide the dataset for training, testing, and validation. To collect the data, the purpose of the study must be defined. In this study, the objective is to predict the corrosion rate, and predictors will be the factors affecting underground corrosion. Table 1 summarizes the predictors, targets, and investigated range of predictors in this study. Although many factors affect soil corrosion, it is challenging to know how many factors affect corrosion in soil. A reasonable number of factors in this stub study were chosen because they are of interest to affect corrosion and they are easy to tune to run the experiment.

Table 1 Defined predictors and targets to predict the corrosion rate of carbon steel in soil.

In AI, more data is always better, because more data results in additional training and a smarter model. If the data are well prepared according to a basic data prep checklist, they will be ready for machine learning, and accurate results will be obtained. In this study, some data were collected from previous studies, and we supplemented the data by running electrochemical experiments in Fig. 5, which are quite sufficient for an accurate result.

Figure 5 Three-electrode setup for collecting data.

Carbon steel SPW400 was used as the working electrode with a composition of 0.04 wt.% S, 0.04 wt.% P, 0.25 wt.% C, and balance Fe (Korean Standard). This material is commonly used in soil industries. The test environment is deionized water with variations in chemical composition and pH. Chloride, bisulfide, and pH adjusted for NaOH, and borate acid saturates, NaCl and Na 2 S were used as listed in Table 1. Temperature changes were controlled using a heating plate. We used a cell consisting of carbon steel SPW400 as the working electrode, a saturated calomel electrode as the reference electrode, and two pure graphites as the counter electrode. Samples were polished with SiC from using 200–600 grit sizes, and the surface was covered with silicone paste to reveal 1 cm2 carbon steel. After the sample was dry, the experiment was run in OCP for 3 h and was then potentiodynamically run from − 0.25 vs. OCP to 1 vs. OCP with a scan rate of 0.166 mV/s. After the experiment, the potentiodynamic polarization curve was obtained in Fig. 6 and it was used with the Tafel method to find the corrosion current density value of each experiment. All collected data are given in Table 2, experiments 1–13 were run in this study and experiments 14–43 were collected from our previous studies10,25.

Figure 6 Potentiodynamic polarization curves of carbon steel with variation in pH, chloride, and bisulfide.

Table 2 Corrosion current density is affected by pH, chloride, bisulfide, sulfate, and temperature.

After collecting and classifying factors, the datasets were split into the training set, testing set, and validation set in the partition step. The training set is the data set used to train the model. The algorithms will learn the models from this training set. The validation set was created to periodically evaluate the trained model. The model after training will adjust the parameter based on the results of the regular evaluation of the validation set. To know if an algorithm or model is good or not, the model needs to be evaluated after being trained through a test data set, also known as a test set. In general, validation data typically helps tune the algorithms, and testing data provides the final assessment. In this study, 70% of the dataset was used as the training set, 15% of the dataset was used as the testing set, and 15% of the data set was used as the validation set.

Implementation and evaluation

The selected algorithms (ANN, CHAID Tree Decision, Linear Regression, Stacking Ensemble) were carried out after the data preparation step. A detailed description of each single algorithm ensemble learning method results is provided below.

Artificial neural network (ANN) algorithms

ANNs are mathematical models built through biological neurons. ANNs consist of groups of jobs, and artificial neurons that can connect and process information by passing along the connections and then calculating new values at the nodes. Many ANNs are also tools for modeling nonlinear statistical data.

The two main types of ANN architectures are feed-forward and feedback networks. In feed-forward, signals only flow in the neural network in one direction, whereas back-forward can be repeated. Feedforward is less computationally complicated and is considered less accurate than feedback networks. The traditional feed-forward network is suitable for modeling input data relationships with one or more output responses, especially with soil35,36.

Network architecture contains the following three layers: the input layer, the hidden layer, and the output layer. After selecting the type of network architecture, the number of hidden layers and units in each layer must be determined. In this study, the input layer has five units of five factors (temperature, chloride, sulfate, bisulfide, pH), and the output layer has one unit for predicting the corrosion current density of carbon steel. In this study, we chose one hidden layer since this is sufficient for most problems. The number of units for the hidden layer can vary, and there are some empirically based rules, the usual is based on “the optimal size of the hidden layer is usually between the size of the input and the size of the output”37. In this study, the size of the input is five and the output is one, to determine the best model, the number of hidden layer units was tested from one to five.

After building the complete network architecture shown in Fig. 7, the weight and thresholds of all neurons must be determined. Each node x i in the input layer is connected to each node in the hidden layer H j . Each of those connections is assigned some weights, w ij . At each node in the hidden layer, the total weights of the nodes from the input layer were calculated as \({F}_{j}=\sum_{i}{w}_{ij}{x}_{i}\). The F j value was transformed via an activation function, such as a sigmoidal function. This process was repeated on all layers and adjusts the connection weights between the nodes until the mean squared error was minimal and the output layer was reached. Backpropagation, Levenberg–Marquarts, and the conjugate gradient method were the three forms of learning algorithms. A good example of an algorithm is backpropagation (BP), which is the method used in this study.

Figure 7 ANN architecture for predicting external corrosion current density of carbon steel in soil.

Of the three methods of the standard and homogeneous ensemble (bagging and boosting) of the ANN algorithm in SPSS IBM modelers with a change in the number of units in the hidden layer from one to five, ANN boosting with two units in the hidden layer was performed with the highest accuracy. The minimum error, maximum error, mean error, mean absolute error (MAE), standard deviation, and linear correlation value was used to evaluate the accuracy of the validation data in Fig. 8a.

Figure 8 Performance evaluation and comparison (a) validation data (b) prediction model the single, boosting and bagging models of ANN for modeling i corr of carbon steel in soil with variations in the unit of hidden layer.

From the accuracy-test value of each number of units, we see that 2-unit in the boosting learning method is the best value in the validation data test. This is a good result for prediction. Therefore, two units for the hidden layer and boosting learning method were chosen for predicting the corrosion rate. However, in Fig. 8b when evaluating the R, R2, R2 adjusted values, the 2 units of the hidden layer are not the highest value, but 5 units of hidden layer. That’s not surprising because maybe in 5-units in hidden layer model training data with a close approximation to experimental observations might be better than 2 units hidden layer. When giving validation data to check the accuracy model, the 2 units of the hidden layer prevails. And of course, the 2-unit of hidden is still chosen as the best method because the validation data is the amount of data that is not in the training process. And it is the thing that can evaluate the performance of the model.

In analyzing the sensitivity of the factors in the study affecting the corrosion rate using the model ANN with two units in hidden layer, it can be seen that the corrosion rate is the most sensitive to temperature, chloride, and sulfate which have a high influence and bisulfate and pH seem to have a very low effect as shown in Fig. 910,38,39,40.

Figure 9 Chart for the standardized effects of temperature, chloride, sulfate, bisulfide, and pH, as predicted by ANN with two units in hidden layer on the corrosion current density of carbon steel in soil.

Decision tree algorithms

The second algorithm chosen in this study is the decision tree. Decision tree learning is one of the earliest and most prominent machine learning algorithms. Decision trees use tree structures to predict the value of an outcome variable. The output of a decision tree is extremely simple to grasp, especially for people lacking an analytical background as they do not require any statistical knowledge to read and interpret them.

As illustrated in Fig. 10, the essential components of a decision tree model are nodes and branches, and the most significant steps in model construction are splitting, stopping, and pruning. There are three basic types of nodes: root nodes, internal nodes, and leaf nodes. The root node, also known as the decision node, represents a choice that will result in the splitting of data into two or more subsets by branches, as multiple opportunities arise. Internal nodes often called opportunity nodes, reflect the various options available in the tree structure. The result is represented by a leaf node, which is also known as the end node. The tree begins with the root node, which contains all the data, and then divides the nodes into various branches using intelligent strategies.

Figure 10 Simple decision tree structure for predicting external corrosion rate of carbon steel in soil.

In addition to the structural composition of the tree, there are steps to build the models which include splitting, stopping, and pruning. When creating a model, the most essential input variables should be defined first, and then the records at the root node and subsequent inner nodes should be divided into two or more categories or buckets based on the state of these variables. This separation technique is repeated until the halting or homogeneity conditions are reached. In most circumstances, not all possible input variables will be used to construct the decision tree model. In some cases, a single input variable will be used many times at different levels of the decision tree. A different algorithm was written to assemble a decision tree, and this can be utilized in the problem. Some of the common tree decision algorithms are classification and regression trees (CART), iterative dichotomiser 3 (ID3), C4.5, and Chi-square automatic interaction detector (CHAID). CHAID was used in this study.

The Chi-square automatic interaction detector (CHAID) is an algorithm that generates a decision tree using Chi-square statistics to determine the optimal decomposition. Continuous predictors are divided into categories with an approximately equal number of observations, whereas categorical predictors are divided into categories with an approximately equal number of observations. For each category predictor, CHAID performs all potential cross-tabulation until the best result is obtained and no further splitting is possible. The CHAID approach can be used to visualize the relationships between the split variables and the accompanying related factor within the tree.

Three methods of standard and homogeneous ensemble learning (bagging, boosting) were employed with the different numbers of tree depths. From the maximum tree depth to the value 5 and onward, the predicted values were identical, and the number of tree depths did not grow until value 5. Table 4 summarizes the results of the evaluation of the 3 models above with minimum error, maximum error, mean error, mean absolute error, standard deviation, and linear correlation value. The prediction results from the boosting method with tree depth of 3 had the lowest MAE value and all other parameters are the best of the validation dataset in Fig. 11a. Even the model CHAID tree decision with tree depth of 3 has the most R, R2, R2 adjusted values in Fig. 11b. Therefore, the boosting ensemble learning method with a tree depth of 3 was chosen as the optimal model for CHAID tree decision.

Figure 11 Performance evaluation and comparison (a) validation data (b) prediction models of the single, boosting and bagging models of CHAID decision tree for modeling i corr of carbon steel in soil with variations in the number of tree depths.

Regarding the evaluation of the sensitivity of the CHAID tree decision model to the factors, temperature continues to be the dominant factor affecting the rate of soil corrosion. Chloride and sulfate ranked second while pH and bisulfate remained the two lowest influencing factors as shown in Fig. 12.

Figure 12 Chart for the standardized effects of temperature, chloride, sulfate, bisulfide, and pH, as predicted by CHAID Tree Decision with three tree depths on the corrosion current density of carbon steel in soil.

Linear regression algorithms

The final algorithm applied in this study is linear regression (LR). Simple linear regression has one predictor variable (X) used to model the response variable (Y). But in this case, the response variable of corrosion current density of carbon steel was affected by more than one predictor variable. Therefore, the multiple linear regression algorithm must be used. Multiple linear regression algorithms are a method to study the relationship between many predictor variables and one response variable. It is often used for prediction in machine learning used for supervised learning. Based on the given data points, it tries to plot a line that models the best points, and its main objective in this algorithm is to find the best fit line. The general formula of the multiple linear regression model is:

$$Y={\beta }_{0}+\sum_{i=1}^{n}{\beta }_{i}{X}_{i}+\varepsilon$$ (1)

In this study, Y (corrosion current density) = output/response variable, \(\beta =\) coefficients of the model. X 1 (temperature), X 2 (chloride), X 3 (pH), X 4 (sulfate) and X 5 (bisulfide) are the independent variables. After putting labeled data into the software, an equation was obtained:

$$Y=0.036 \, pH+0.489\, Chloride+0.078\, Bisulfide-0.179\, Sulfate+0.652\, Temperature$$ (2)

Because linear regression algorithms use data to fit the equation, a random amount of the same dataset still produces the same equation, and homogeneous ensemble learning is not effective in this algorithm. The accuracy of the equation is listed in Table 3.

Table 3 Summary of linear regression model for predicting i corr of carbon steel in soil.

The R-value represents the correlation, and it was 92.8%, indicating a very high degree of correlation. The R2 value indicates the percentage of the total variation in the response variables.

As shown in Table 4, ANOVA reports the fit of the regression equation to the data, it shows that the regression model predicts the response variable well. The regression row and the Sig column show the statistical significance of the run regression model. Here, p < 0.05, indicates that the regression model predicts statistical significance on the response variables (it fits the data). To confirm the statistical significance of the fit of the overall regression model, the obtained F-value was compared with the F-critical value. The F-critical value in the F-distribution was determined by the cut-off between columns degrees of freedom (df) of F numerator and df of the denominator or error of F.

df of F numerator = number of beta parameters in the regression model—1 = 6 − 1 = 5

df of F denominator = n—number of beta parameters in the regression model = 43 − 6 = 37.

Table 4 ANOVA of linear regression model for predicting i corr of carbon steel in soil.

Looking up in the 5% distribution, the F-critical value (df of F numerator, df of F denominator) was from 2.534 to 2.450. The result of the F-test total panel ANOVA was 45.881, which is much higher than F-critical. This indicates that the overall regression model was statistically significant, and variables of pH, chloride, bisulfate, sulfate, and temperature are significant predictors of the response variable of the corrosion rate.

Table 5 provides the information needed to predict the corrosion rate from the 5 factors as well as to determine if these 5 explanatory variables contribute in a statistically significant manner to the model by looking at the Sig. column. In the Table 5, the results show that there are 3 coefficients of chloride, temperature, and sulfate that are statistically significant (p < 0.05). Furthermore, the values in column B can be used in the unstandardized coefficients column. However, since the values in this study have different units, it is most appropriate to use standardized coefficients. The important result MAE of linear regression is 3.696, which is a pretty good value.

Table 5 Coefficients of linear regression model for predicting i corr of carbon steel in soil.

Evaluating the influence of the research factors in LR, temperature is still the factor that the corrosion rate of carbon steel is most sensitive. In Fig. 13, chloride increases the degree of influence on the corrosion rate of carbon steel higher than the 2 algorithms above, and the remaining 3 factors are still considered to have a minor influence on corrosion rate.

Figure 13 Chart for the standardized effects of temperature, chloride, sulfate, bisulfide, and pH, as predicted by LR on the corrosion current density of carbon steel in soil.

Heterogeneous ensemble learning—stacking

All three individual algorithms and homogeneous algorithms in this study are simple and easy to implement and the prediction results are quite accurate, however, the heterogeneous ensemble learning method was applied in this section to improve the accuracy as much as possible.

According to Fig. 14a, the boosting method of ANN with 2 units in the hidden layer gives the best result with the MAE of 3.797. Boosting CHAID with a tree depth of 3 gives the best results in the CHAID decision tree algorithm with the MAE of 3.457, and the prediction results of the linear regression showed the MAE of 3.696. In comparing the three algorithms, it seems that the CHAID decision tree gave the best prediction results. However, heterogeneous ensemble learning was implemented for a better predictive value to improve the model. Indeed, the three best-selected models above combined for a model with the MAE value of 3.259, which is the smallest value of all the models running in this study, all other values, the stacking ensemble method is still the model with the most accuracy even in Fig. 14b and summary MAE performance matrices in Fig. 15. The R, R2, R2 adjusted values of the stacking ensemble method are still the best model. The training results of the four models and checking the prediction results with the validation dataset are shown in Fig. 16.

Figure 14 Performance evaluation and comparison (a) validation data (b) model of methods.

Figure 15 Mean absolute error performance matrices for prediction carbon steel corrosion in soil.

Figure 16 Predicted value versus the measured external corrosion rate (a) training data (b) testing data of carbon steel in soil.

Finally, the two major factors that the corrosion rate of carbon steel in soil temperature and chloride are shown on the response surface and contour surface according to the model of the stacking ensemble method in Fig. 17.