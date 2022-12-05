Data source and clinical outcomes

Data source

1383 cases with BGH were first diagnosed by computerized tomography in the neurosurgery and emergency department of our hospital and with complete clinical data from 1 August, 2005 to 1 August 2021 were selected (Supplementary Table 1). The first issue to be considered in this study was how to select the time point of death to classify the high-risk group from the low-risk group. By observing the survival time of all patients who met the inclusion criteria, we found that the mortality rate decreased suddenly on the 5th and 7th days after BGH (Fig. 1). We then divide the period according to these two-time points and perform a two-sample independent t-test, and the results showed a cut-off at day 5 after BGH (P < 0.001, t = 5.7789) versus day 7 after BGH (P < 0.001, t = 6.4059). Therefore, we used whether the survival time after BGH was greater than seven days as a criterion to classify low risk versus high risk (Fig. 2). Also, patients were assigned to the conservative treatment group (CTG) and surgical treatment group (STG) based on whether surgery was selected. All processes of this study conformed to the ethical standards of the institutional and national medical ethics committees, as well as to the 1964 Declaration of Helsinki and similar ethical standards. This study was approved by the Medical Ethics Committee of Sichuan Provincial People’s Hospital (approval number: 2022-154), and the committee waived the requirement for written informed consent because of the retrospective nature of this study.

Figure 1 Distribution of time to death in patients with basal ganglia hemorrhage. The upper right corner shows deaths within 30 days of study subjects.

Figure 2 Flow diagram of study recruitment and exclusion.

Evaluation of clinical outcomes

The demographic data and clinical information were carefully gathered retrospectively from medical records, including gender, age, body type, prehospital time, smoking history, stroke history, complications (hypertension, hyperlipidemia, diabetes mellitus, epilepsy, brain edema, brain herniation, subarachnoid hemorrhage (SAH), lobar hemorrhage, fracture), pupil, hemorrhage breaking into ventricles (HBIV), hematoma volume measured by CT and laboratory examination indexes (hemoglobin, high-sensitivity C-reactive protein [hsCRP], consciousness at admission, blood pressure, pulse, and body temperature, intraoperative hematoma volume, bleeding volume, blood transfusion/fluid volume, operation time). Laboratory indexes are measured based on the data collected for the first time at admission; hematoma volume was calculated by applying the ABC/2 method, whereas A is the longest diameter (cm), B is the widest diameter (cm) and C represents the sum of the thickness of slides of hematoma in the CT scans15. The patients were dichotomized into two groups, the “high-risk” group with brain death declared within 7 days after admission and the “low-risk” group with survival longer than 7 days. Also, patients were assigned to CTG and STG based on whether surgical decisions (including decompressive craniectomy, external ventricular drainage, craniotomy evacuation of hematoma, and micro-invasive hematoma removal) were made. The relevant prognostic risk factors were analyzed.

Hypertension is defined as having a clear history of hypertension in the past or having multiple blood pressure measurements greater than 140/90 mmHg in this admission. According to the standard of the American Diabetes Association (2014), diabetes is diagnosed as follows: (1) fasting blood glucose is more than 7 mmol/L; (2) after oral glucose tolerance test, blood glucose is more than 11.1 mmol/L after 2 h; (3) patients with diabetes symptoms have random blood glucose equal to 11.1 mmol/L; (4) diabetes mellitus history or taking hypoglycemic drugs. Alanine aminotransferase > 50 IU/L is defined as abnormal liver function. Triglyceride > 1.7 mmol/L or total cholesterol > 5.2 mmol/L, with or without LDL cholesterol > 3.1 mmol/L is defined as hyperlipidemia. Smoking history was defined as patients revealing a history of smoking for > two pack-years and current smoking. Cardiovascular diseases include a history of myocardial blood deficiency, a history of myocardial infarction, arrhythmia (atrial fibrillation, ventricular fibrillation, bundle branch block above grade II, etc.), and heart failure. Renal insufficiency is defined as eGFR < 60 mL/min/1.73 m2 or serum creation clearance rate ≤ 104 mmol/L in the latest half-year. Infection includes pulmonary infection, urinary infection, HIV, HBV, and HCV.

Model algorithms

Overview of the framework

We propose an analytical framework for the BGH risk stratification problem (Fig. 3). It includes three steps: data preprocessing, model construction, and interpretability analysis. Data cleaning, vector coding, and data resampling were the first to be applied to the raw data. We used the pre-processed balanced data as training input for model construction. The dataset was divided into training and testing sets in the ratio of 7:3. We built standard ML models using Boosting16 and Bagging17 methods, and fusion ML models integrated with standard ML models to improve the performance of risk stratification and provide effective clinical guidance. Finally, we chose the two models with better performance for the interpretability analysis of risk stratification.

Figure 3 Overall workflow summarizing model algorithm.

Data preprocessing

We first performed data cleaning based on clinical a priori knowledge, removed conflicting or commonsense violating samples, and excluded some features irrelevant to the results and very small sample sizes. We also standardized coding on some variables: binary features such as gender, smoking, and brain herniation were coded 0–1, with 0 for negative and 1 for positive; among multiple categorical features, blood type was coded 1–5 (4 conventional blood types versus unchecked), and pupils were coded 1–6 according to whether they were equilibrated and normal/reduced/dilated, etc. In addition, prehospital time and operative time units were standardized to days or hours. In addition, we use the K-nearest neighbor (KNN) algorithm to fill some missing value features18. A total of 40 features were selected and the sample was classified as CTG or STG according to whether surgery was performed. finally, we built the model and implemented a binary classification to predict patient risk. After data preprocessing, we obtain a dataset of 1383 records, 743 and 640 records belonging to the CTG and STG, respectively.

ML models

We chose traditional statistical methods: logistic regression (LR) for performance reference, three standard machine learning models based on Bagging or Boosting, Bagging: Random Forest (RF)19, Boosting: XGBoost20 and LightGBM21, and three fusion ML methods: Weight, Stack22, and Weight-Stack models for Electronic Health Records (EHR) data.

1. Weight The Averaging-like approach is used to assign different weights to the prediction results of multiple models to aggregate their prediction probabilities and improve the comprehensive performance. We need to take into account both the detection rate of high-risk patients and the misclassification rate, and finally, choose the F1 score as the weight of the model. 2. Stack A model layering approach is used. The prediction probability vectors based on each training and test set are first obtained using multiple lower-layer models, which are then integrated separately and input to the upper-layer model for retraining to obtain the final prediction labels. The upper-layer model in Stack takes the prediction probabilities as input instead of the original data and uses K-fold cross-validation internally to address possible overfitting problems. 3. Weight-Stack Combining the use of Weight and Stack methods, the better performing model is selected from the base models and weighted again with the Stack model using the F1 score as weight.

Weight and Stack use different approaches to integrate the advantages of heterogeneous models and improve predictive performance.

We used the following three points to avoid over-fitting situations: (1) the parallel generation of the prediction function in our chosen Bagging model can reduce the variance of the model prediction. (2) we perform cross-validation in the training set, giving the Bagging and Boosting models a restricted hyperparameter (number of base learners). (3) appropriate data cleaning to improve the volume and comprehensiveness of data distribution can effectively prevent overfitting. We use a combination of under-sampling and over-sampling to balance the training set data and perform data augmentation to get an accurate training model.

Imbalanced data

In the dataset, the incidence of death within 7 days (high-risk) is low at 10.3%. This indicates that there is a category imbalance in our data, meaning an extremely significant difference in the sample size between the different categories23. By training on an unbalanced dataset, the model will focus more on most categories, leading to reduce the overall performance of the model24. To decrease the cost of misclassification, we must address the problem of misclassification of high-risk categories. We use the following methods to deal with imbalanced data: 1. add hyperparameters to provide higher class weights for small sample classes. 2. introduce the SMOTEEN algorithm based on the EE algorithm and the SMOTE algorithm, which uses a combination of oversampling and undersampling to solve the problem of lack of minority class samples and noise interference25. We load the SMOTEEN algorithm on the training set so that the model learns from balanced data and copes with the challenges posed by an imbalanced testing set. Ultimately, the training set samples rose from 520 to 737 in CTG and from 448 to 601 in STG. In addition, it was ensured that the ratio of high-risk to low-risk patients was approximately 1:1.

Features and interpretability

ML is a “black box” in which clinical managers can only see the prediction results of a model, but not the origin of the results, making it difficult to be accepted by physicians in clinical practice. To address the problem of poor feature referencing and interpretability, we use the SHAP algorithm to interpret the risk stratification results and provide feature importance and correlation analysis for the whole or individual26. Avoiding the generalization problem of a certain model, we piggyback the SHAP algorithm on a standard ML model together with a fusion model to provide interpretability.

Statistical analysis

We use receiver operating characteristic curves (ROC) and precision-recall curves (PR) to evaluate the performance of each model. The ROC curves do not reflect true model binary classification performance when the data are unbalanced, so we combine them with the results of the PR curves analysis to analyze the results more comprehensively for a small number of categories of samples27. The area under the ROC curves (AUC/ROC) is presented as a measure of the ROC curves’ performance, as well as the area under the PR curves (AUC/PR) using the average precision score as an approximation of the PR curve’s performance measure28. We also report the sensitivity (Recall), specificity, accuracy, precision (PPV), negative predictive value (NPV), and F1 score for each model. All reports were obtained by using the recommended hyperparameters from the fivefold cross-validation and by repeating the experiments. Descriptive statistics were performed on categorical and numerical variables. Numerical variables were presented with means (standard deviation [SD]), while numbers (%) were used for categorical variables. Two groups were compared by the Wilcoxon rank sum test for paired data in non-normal distribution, and by a t-test of paired samples in a normal distribution. The significance level was set at P < 0.05.