Contact strength

There are 94,733 Telecom base stations in Shanghai with an average coverage of 0.0669 square kilometers. We define that if two mobile phone signals interact with one base station at the same time slice τ, then the two individuals’ trajectories have a coincidence. In this paper, the time slice τ is set to 1/12 hour. If individuals coincide with high-risk group, the risk of infection will increase, and consequently such contacts are called effective contacts; while the mutual contacts within the general group do not generate new risks of infection, such contacts are invalid. In order to simplify the contact analysis, it is necessary to concentrate on the effective contacts when identifying regional transmission risks of infectious diseases.

Furthermore, we constructed the contact strength to quantify the influence of effective contacts. Effective contact frequency is one of the determinants to increase the infectious transmission risks. The longer an individual has been exposed to the high-risk group, the more likely to be infected. Nevertheless, only considering the duration of effective contact is not enough. Since the individuals in high-risk group have been to different epidemic hot zones, the possibilities of carrying virus are distinct and we use a dynamic virus carrying risk coefficient to distinguish one from another. Thus, the contact strength can be calculated by the product of virus carrying risk coefficient from high-risk individual \(h\) and effective contact frequency,

$$\omega_{h \to i,d} = t_{h,i,d} \times \gamma_{h,d}$$ (1)

where, \(\omega_{h \to i,d}\) represents the contact strength between individual \(i\) and individual \(h\) on day \(d\), which will be the weight of corresponding edge in the \(d\) th contact network. \(t_{h,i,d}\) is the times of effective contacts between individual \(i\) and \(h\) on the \(d\) th day.

The virus carrying risk coefficient \(\gamma_{h}\) of individual \(h\) is determined by the epidemic hot zone with the highest risk coefficient in the recent viral incubation period \(T_{virus}\). First of all, we define the epidemic infection density \(\rho_{c}\) of epidemic hot city \(c\) as the proportion of the cumulative number of confirmed cases in the permanent population,

$$\rho_{c} = \frac{{Nc_{c} }}{{Np_{c} }}$$ (2)

where, \(Nc_{c}\) is the cumulative number of confirmed cases in city \(c\), and \(Np_{c}\) is the permanent resident population of city \(c\) (unit: 10,000 people). We set infection density \(\rho_{o}\) of the city with the transmission risk to be estimated as the baseline and adjust the other cities’ infection densities, so as to obtain the risk coefficient for travelling or living in city \(c\),

$$r_{c} = \frac{{\rho_{c} }}{{\rho_{o} }}$$ (3)

where, \(r_{c}\) is the risk coefficient of travelling or living in city \(c\), and \(r_{o}\) is the risk coefficient of the city to be estimated. Obviously, \(r_{o} = 1\). Therefore, the \(\gamma_{h,d}\) is equal to the maximum value of \(r_{c}\) in the historical trajectory of individual \(h\) counting down \(T_{virus}\) from day \(d\).

Contact networks

In order to simulate the risks of spread infectious diseases in the crowd better, we proposed a growing network based on the microscopic spatiotemporal contact details among individuals, which called contact network. In this contact network, every mobile phone user is a node. Only when the effective contact occurs, the corresponding nodes will form an edge and the weight of the edge is their contact strength. As shown in Fig. 1a, the red dots indicate individuals of high-risk group and green dots indicate individuals of general group. At time T, there are two high-risk individuals under Station 1 and they have effective contacts (red line) with other people under the same station; while the other contacts are invalid (green line). And under Station 2, all people are belonging to the general group, so there is no effective contact. Thus, each base station forms a sub-network. After a time slice τ, some individuals move from one station to another, and then each base station generate a new sub-network following by the latest contacts. With people moving across the base stations during one day, such sub-network will be generated continuously. At the end of the day, all of the effective contacts and the nodes to which they are connected eventually form a daily contact network. Obviously, people who do not have contacted with the high-risk group are not included in the contact network.

Figure 1 Contact networks structure. (a) Schematic diagram of sub-networks and contact network, taking two base stations as an example. Note that this figure only shows the trajectory simulation of two high-risk individuals and seven general individuals during two time slices, but in fact, each contact network is composed of \(24/\tau\) sub-networks of all base stations. (b) Visualization of individual-centered contact feature sequence transformation. Before learning the transmission risk, the model takes each individual as an observation object and extracts the contact information of the adjacent nodes from the contact networks within \(T_{virus}\).

Because the contact network describes the possible path of epidemic spreading in detail, we can further learn the transmission risk based on artificial neural network. The purpose of transmission risk learning is to identify individuals with higher potential infectious risk and estimate the corresponding probabilities. Here we mainly consider the first layer of virus transmission risks, that is, the infection between adjacent nodes in contact network. Therefore, as shown in Fig. 1b, all contact networks within nearly \(T_{virus}\) days are transformed into individual-centered single-layer networks. \(T_{virus}\) is the latent period of the infection and the potential risk of carrying virus can be taken into account by selecting the contact networks during the \(T_{virus}\). And then, we extract contact feature sequences from those single-layer networks as the input of artificial neural network. Each contact feature sequence consists of two element sequences: \(TF\), which represents the total contact strength, and \(K\), which indicates whether the individual has contacted with the confirmed cases,

$$TF_{i,j,d} = \mathop \sum \limits_{{h \in H_{i,j,t} }} \omega_{h \to i,d}$$ (4)

$$K_{i,j,d} = \left\{ {\begin{array}{*{20}c} 1 & {if\ there\ are\ confirmed\ cases\ in\ H_{i,j,d} } \\ 0 & {otherwise} \\ \end{array} } \right.$$ (5)

where \(i\) is an individual, \(j\) is the municipal district of the city to be estimated and \(d\) is the time. Thus, \(TF_{i,j,d}\) indicates the contact intensity between individual \(i\) and high-risk group in area \(j\) on day \(d\), which is the sum of edge weights of corresponding nodes in contact network. \(H_{i,j,d}\) is the subset of high-risk group who had contact with individual \(i\) in area \(j\) on day \(d\) effectively. If there is a confirmed case in subset \(H_{i,j,d}\), then \(K_{i,j,d}\) equals 1, otherwise it is 0.

Artificial neural network of extreme events

Artificial neural network is used to learn epidemic transmission risk. After completing the feature transformation of contact network nodes, we calculate the cross term of contact intensity \(TF\) and contact tag \(K\). These three variables are standardized and then used as the input variables of the neural network. And then, we label the high-risk people by the potential risks’ sources. Those isolated people are divided into two categories according to whether they had a sojourn to epidemic hot zone. If people have not been to the epidemic hot zone, their infection risks come from the contact in the observing area. In contrast, people who have been to the epidemic hot zone, the regional transmission risk comes from the epidemic hot zone people inflow. The rest individuals of the high-risk group are labeled as the third category.

As shown in Fig. 2, the basic framework of the network is fully-connected and adopts leaky ReLU as activation function to reduce the silent neurons. However, isolation is an extreme event, that is, the proportion of positive-marked data in the dataset is very low. The high-risk group accounts for a very small number of the total population, let alone those who are isolated. Due to the imbalance of three kinds of people, it is necessary to adjust the neural network in the multi-classification training61,62,63. Therefore, in order to avoid the prediction error of the true positive cases caused by imbalanced data training, the neural network adopts a weighted cross entropy \(L\left( {Y,P} \right)\) as the loss function for extreme event learning,

$$L\left( {Y,P} \right) = – \frac{1}{N}\mathop \sum \limits_{i} \left( {w_{k} \mathop \sum \limits_{k} y_{i,k} {\text{log}}p_{i,k} } \right)$$ (6)

where \(N\) is the size of training sample, \(k\) marks different classes. \(y_{i,k}\) indicates whether the individual \(i\) belongs to class \(k\), if so, it is 1; otherwise, it is 0. \(p_{i,k}\) is the probability that the model predicts individual \(i\) belonging to class \(k\) and \(w_{k}\) is the weight of class \(k\).

Figure 2 Artificial neural network structure. Visualization of neural network learning. After max–min scaling the input variables, the contact features can be learned by a fully-connected neural network. During model training, some neurons (dotted dots) are temporarily discarded from the network according to a certain probability, so that the network can avoid over fitting and be generalized better.

This loss function can give larger weight to the rarer categories, that is to say, the corresponding \(w_{k}\) of the isolated groups are larger in order to increase the misclassified cost of these two rare categories, so that the neural network can learn useful information more effectively and achieve better prediction results.

After normalizing the initial learning results of neural network by Softmax, the probability that individual \(i\) belongs to each class can be obtained. The class with the largest probability is the prediction class of individual \(i\).

Estimation of regional transmission risk

The main residence of each individual is determined by their most frequently located region for mobile phone signals during the night. Thus, we can divide those people into different group in terms of their residences. The risks of infectious disease transmission will come from the activities of people living there.

Since we have labeled the high-risk people as three categories and used multi-classification learning to fit how likely these people are to belong to the certain category, risk due to epidemic hot zones people inflow and risk due to close contacts are the average probability of corresponding-labeled individuals settled here,

$$TR_{s}^{{\left( {inflow} \right)}} = \frac{1}{{N_{s} }}\mathop \sum \limits_{i = 0}^{{N_{s} }} p_{i,s}^{{\left( {ehz} \right)}}$$ (7)

$$TR_{s}^{{\left( {contact} \right)}} = \frac{1}{{N_{s} }}\mathop \sum \limits_{i = 0}^{{N_{s} }} p_{i,s}^{{\left( {non} \right)}}$$ (8)

where \(s\) is the region of risk to be assessed, \(N_{s}\) is the number of individuals settled in \(s\). \(p_{i,s}^{{\left( {ehz} \right)}}\) is the probability of disease transmission from epidemic hot zones caused by individual \(i\) and \(TR_{s}^{{\left( {inflow} \right)}}\) represents the risk caused by the inflow people from epidemic hot zones. Similarly, \(p_{i,s}^{{\left( {non} \right)}}\) is the probability of disease transmission caused by individual \(i\) who have not been to the epidemic hot zones and \(TR_{s}^{{\left( {contact} \right)}}\) represents the risk caused by the close contacts within the observing region. It is obvious that \(TR_{s}^{{\left( {inflow} \right)}}\) and \(TR_{s}^{{\left( {contact} \right)}}\) are between 0 and 1, and larger values mean higher regional transmission risks.

Because of the properties of the Softmax function, the probabilities of no risk and other two risks are additive, and the sum of them is equal to one. Thus, the total transmission risk can be derived from \(TR_{s}^{{\left( {inflow} \right)}}\) and \(TR_{s}^{{\left( {contact} \right)}}\),

$$TR_{s} = TR_{s}^{{\left( {inflow} \right)}} + TR_{s}^{{\left( {contact} \right)}} = \frac{1}{{N_{s} }}\mathop \sum \limits_{i = 0}^{{N_{s} }} \left( {p_{i,s}^{{\left( {ehz} \right)}} + p_{i,s}^{{\left( {non} \right)}} } \right)$$ (9)

\(TR_{s}\) ranges likewise from zero to one. Because this is a bottom-up indicator, the regional transmission risk will rise if individuals are more likely to be classified into potential isolated group. In the contrast, if most individuals are predicted as the non-isolated group, the regional transmission risk will decrease.

Data

We intercepted China Telecom’s mobile signaling data in Shanghai from January 22 to February 4, 2020 to capture the users’ real-time trajectories. We divided these 7,451,621 mobile phone users into high-risk group and general group according to their epidemiological diagnosis and historical action trails. High-risk group includes four kinds of people: the confirmed cases, the suspected cases, the medical isolators other than the first two and the people who once had a sojourn to epidemic hot zone. Considering the features of epidemic transmission and population flow in the early stage of COVID-19, forty-eight cities in China, including Wuhan and Wenzhou, were marked as the high-risk epidemic hot zones (see more details in Supplementary Information). And then, we identified 735,546 high-risk users in Shanghai based on mobile phone tracking during this period. In addition to the high-risk group, the rest of mobile phone users belonged to the general group.

As of February 4, 2020, there were 22,501 people in the isolation list provided by Shanghai Center for Disease Control and Prevention, covering eight districts in Shanghai. This eight districts include Baoshan, Chongming, Hongkou, Huangpu, Minhang, Pudong, Songjiang and Xuhui. Among them, 2459 isolators on the list were effectively matched, accounting for only 0.3343% of the Telecom high-risk users. In these matched isolators, 1742 isolators had epidemic hot zone sojourn and 717 isolators did not leave Shanghai during the observing period, accounting for 0.2368% and 0.0975% of the Telecom high-risk users respectively. For those users, being isolated was indeed a rare event.