### Clinical data sources

After obtaining the data access authorization from the Information Network Center of the Third Xiangya Hospital, we retrospectively collected the EHRs of 51,866 hospitalized patients from the Hospital Information System (HIS). The Third Xiangya Hospital of Central South University is located in Changsha City, Hunan Province. It is a grade A tertiary general hospital serving the middle reaches of the Yangtze River. This study was approved by the IRB of the Third Xiangya Hospital of Central South University (No: 21112). All data had been completely anonymized prior to our access to the HIS. Because the patients’ personally identifiable information had been effectively protected, the IRB of the Third Xiangya Hospital of Central South University waived the requirement for informed consent. The present study was conducted in accordance with the guidelines of the Declaration of Helsinki.

### Study population

The EHRs of 51,866 hospitalized patients with T2DM in the Third Xiangya Hospital from January 1, 2013 to December 31, 2020 were collected. Exclusion criteria: (1) under 18 years of age (n = 30), (2) diseases affecting Mb (n = 7457), (3) infectious diseases (n = 9258), (4) organ dysfunction (n = 4169), (5) special body status (n = 4810), and (6) missing urine microalbumin (UMALB), eGFR and/or serum Mb data (n = 25,387). Finally, 728 patients (DKD = 286, non-DKD = 442) were eligible for the study. Eligible patients were divided into training set (60%, earlier than March 9, 2016) and testing set (40%, later than March 9, 2016), including 437 (DKD = 158, non-DKD = 279) and 291 patients (DKD = 128, non-DKD = 163), respectively, based on hospitalization date. The inclusion and exclusion criteria of the study population are presented in Fig. 1, and the demographic and clinical features are available in Table 1.

### Data preparation

From the EHRs of T2DM, we extracted a total of 157 features in the following 7 domains: demographic information, hospitalization-related information, general information, laboratory tests, diagnosis, medication, and others. The data domain categories and descriptions are listed in Table 2. The details of feature calculation are shown in Table S1.

Diagnoses were organized using the ICD -9 and ICD-10 hierarchies. These diseases were diagnosed by experienced doctors who were licensed to practice medicine. T2DM was diagnosed according to the World Health Organization (WHO) criteria (1999)^{24}. DKD was diagnosed by the NKF/KDOQI (2007) classification^{25}. Serum Mb was tested by Mb-LATEX CN from Denka Seiken (Reference range 0–70 ng/mL).

The components of MetS have the following three sets of features: IR and β-cell function were evaluated by fasting C-peptide, 120 min C-peptide, Homeostatic Model Assessment of Insulin Resistance^{26} (HOMA-IR), Gutt index^{27}, triglyceride-glucose (TyG) index^{28}, fasting insulin to glucose ratio^{29} (I/G 0 min), 120 min insulin to glucose ratio^{29} (I/G 120 min), insulinogenic index^{30} (IGI), IGI to HOMA-IR ratio^{31} (IGI/HOMA-IR), and Homeostasis Model of Assessment for beta-cell function^{32} (HOMA-BETA). Glucose metabolism indicators included fasting glucose, 120 min glucose, glycated hemoglobin A1c (GHBA1C), and hemoglobin glycation index^{33} (HGI). Lipid metabolism was assessed by total cholesterol (TC), triglycerides (TG), low-density lipoprotein cholesterol (LDL-C), high-density lipoprotein cholesterol (HDL-C), HDL-C/TC, and TG/HDL-C ratio^{34}.

### ML algorithms

Two popular ML methods, extreme gradient boosting (XGBoost) and random forest (RF), were used to build DKD classification models based on EHR data. RF, first proposed by Breiman in 2001^{35}, is an ensemble of unpruned classification or regression trees created using bootstrap samples of training data and random feature selection in tree induction. XGBoost is an efficient and scalable implementation of the gradient boosting framework^{36}. It is used to develop models in a sequential stagewise fashion like other boosting methods and generalize them by allowing optimization of arbitrary differentiable loss functions. RF and XGBoost were applied using the randomForest (4.6–14)^{37} and xgboost (1.3.2.1)^{38} R packages, respectively. All ML methods used grid search and fivefold cross-validation for parameter adjustment. The final hyperparameters for the DKD models are listed in Table S2.

### Model validation

The evaluation indicators of the confusion matrix, including sensitivity (SE), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), recall, F1, and balanced accuracy (BA), were used to analyze the discrimination ability of DKD ML models. The probability scores output by the models were used to calculate the area under the receiver operating characteristic (ROC) curve (AUC). These statistical parameters are defined as follows:

$$SE=\frac{\mathrm{TP}}{TP+FN},$$

$$SP=\frac{\mathrm{TN}}{TN+FP},$$

$$\mathrm{PPV}=\frac{\mathrm{T}P}{TP+FP},$$

$$\mathrm{NPV}=\frac{\mathrm{T}N}{TN+FN},$$

$$\mathrm{Recall}=\frac{\mathrm{T}P}{TP+FN},$$

$$\mathrm{F}1=\frac{2\times PPV\times \mathrm{recall}}{PPV+recall},$$

$$BA=\frac{\mathrm{SE}+\mathrm{SP}}{2},$$

where true positive (TP) is the number of correctly classified objects, true negative (TN) is the number of correctly misclassified objects, false positive (FP) is the number of incorrectly classified objects, and false negative (FN) is the number of incorrectly misclassified objects.

### Feature evaluation and model interpretation

Feature importance of RF was estimated by (1) mean decrease in prediction accuracy without the feature in the model and (2) mean decrease in the Gini index, a measure of impurity of the dataset (i.e. risk of misclassifying data), by including the feature. For both parameters, the higher the score, the more important the feature.

We evaluated feature contributions to model predictions using an interpretability method called SHapley Additive exPlanation (SHAP). SHAP is based on the Shapley value, a concept introduced in the 1950s to measure each player’s contribution to a collaborative game^{39}. In 2017, Lundberg and Lee proposed a broadly applicable SHAP method to interpret a variety of models, including regression and classification^{40,41}. For binary classification in XGBoost, the output of the model is the log odds ratio, that is, the SHAP values sum up to the log loss of the model for each sample. The function is defined as follows

$$z={\varphi }_{0}+\sum_{i=1}^{M}{\varphi }_{i}{x}_{i}^{^{\prime}},$$

where \(z\) is the log odds ratio, \({x}^{^{\prime}}\) is a simplified binary feature vector containing values 1 or 0 for observed and missing features, M is the number of simplified input features, and \({\varphi }_{i}\) is the attribution value of the ith molecular descriptor or the SHAP value. In probability theory, \(z\) is the corresponding value of probability p after a certain transformation, and the transformation function is defined as follows

$$z=\mathrm{log}\frac{p}{1-p},$$

where \(p\) varies monotonically in the range (0, 1) and \(z\) varies from infinitesimal to infinity. A feature is predicted to be a risk factor if its \(z\)-value is greater than 0. The larger the z-value, the more likely it is a risk factor rather than a protective factor. Therefore, the SHAP method can help us understand the model more intuitively. The SHAPforxgboost (0.1.1) and fastshap (0.0.5) R packages were applied in the analysis.

### Decision curve analysis (DCA)

We used DCA to assess the clinical utility of ML models. A decision curve is generated by plotting net benefit against a range of threshold probabilities. Decision curves include benefits and harms on the same scale, so they can be directly compared, supporting comparisons between models. The net benefit is calculated as

$$Net benefit=\frac{True positive}{N}-\frac{True negtive}{N}\times \frac{ {P}_{t}}{1-{P}_{t}},$$

where N is the total sample size and \({P}_{t}\) represents the threshold probability. Two extreme strategies, classifying for all and classifying for none, were added for reference. The R package rmda (1.6) was used.

### RCS models

The association between serum Mb and DKD was evaluated on a continuous scale with RCS curves based on logistic regression models. To balance best fit and overfitting, the number of knots was chosen empirically to be 4. The serum Mb associated with the lowest risk of DKD was the concentration with the lowest odds ratio on the spline curve. Odds ratios were adjusted for gender, age, BMI, hyperlipidemia, hypertension, age-adjusted Charlson comorbidity index (ACCI), and hospitalization date.

### Causal mediation analysis

Causal mediation analysis identifies potential pathways to explain the observed association between a given exposure and a particular outcome^{42}, examining how a third intermediate variable, the mediator, is related to the observed outcome relationship. By using a counterfactual framework, causal mediation analysis allows us to decompose the total effect (TE) into direct effects (DE) and indirect effects (IE). In our hypothetical model, we defined TE as the effect of MetS components on renal function impairment, DE as the effect of MetS components on renal function impairment after removing the individual effect of Mb on renal function impairment, and IE as the effect of MetS components on renal function impairment specially associated with Mb. A directed acyclic graph (DAG) illustrating mediation process is shown in Fig. 5A. Through the causal mediation analysis, we estimated the mediation effect of serum Mb based on logistic regression in all eligible patients, DKD or non-DKD subgroups. According to previous DKD investigators^{43,44}, we adjusted and standardized the MetS components, Mb, UMALB, and eGFR for 7 variables: gender, age, BMI, hyperlipidemia, hypertension, ACCI, and hospitalization date. The proportion of serum Mb-mediated renal function impairment due to the MetS components was calculated as

$$Prop.Mediated\left(average\right)=\frac{\mathrm{log}\left(IE\right)}{\mathrm{log}\left(IE\right)+\mathrm{log}\left(DE\right)},$$

Causal mediation analysis was performed using the R package mediation (4.5.0).

### Statistical analysis

R software (version 4.0.3) was used as a statistical analysis tool in this study. Data are presented as median [interquartile range] for continuous variables or number (%) for categorical variables. Differences between groups were based on the Wilcoxon test for continuous variables and the chi-square test for categorical variables. Relationships between MetS components, serum Mb, and DKD were assessed with the Spearman’s correlation coefficient (r_{s}), controlling for 7 variables: gender, age, BMI, hyperlipidemia, hypertension, ACCI, and admission date. All statistical significance was defined as P < 0.1. Missingness was addressed by setting missing categorical data to 0, whereas for numerical data, imputation was completed using the predictive mean matching (PMM) strategy.

### Bias

Observer bias may arise from the data on patients hospitalized in a grade A tertiary general hospital, as these patients may have more severe than average DM or DKD. Besides, selection bias caused by exclusion of UMALB missing individuals brings DKD rate elevation but comparable cardiovascular risks, leading to limited representation (Table S3). However, the significance and possible mediator role of serum Mb in DKD, as a common rule, was mainly focused in our study, and we believe that the bias did not greatly affect our conclusion.

Source link