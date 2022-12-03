This section discusses the case study area and sources of data, then provides a short review of aspatial and spatial linear models, random forests and neural networks. A new spatial random forest method is also introduced in this section.

Study area

Queensland is the second largest and third most populous Australian State or Territory, and is located in the northeast of the country. With strengths in mining, agriculture, tourism, international education, insurance, and banking. Queensland also has the third largest economy [16, 17]. The State is divided geographically into 528 non-overlapping statistical area level 2 (SA2) regions (according to the ASGS 2016 boundaries of the Australian Bureau of Statistics, ABS). SA2 regions are medium-sized general purposed areas that are designed to represent a community that interacts together socially and economically (www.abs.gov.au). This is the smallest area for the release of ABS non-census and inter-censal statistics, including the estimated resident population and health data, and data from the 2016 Census of Population and Housing.

In this study, health and socio-demographic data are obtained at the SA2 level for 526 SA2s, excluding those with zero population and with offshore/migratory or undefined location.

The data repository

The outcome variables considered in this study were development vulnerabilities, provided by the Australian Early Development Census (AEDC). The AEDC takes place every three years and is the world’s most extensive data gathering for children. Classroom teachers complete the census for their students in their first year of full time school, and their answers are used to construct domain scores. Each child is given a score between zero and ten for each of the AEDC domains, using the cut-offs established as a baseline in 2009, children falling below the \(10^{{th}}\) percentile in a domain, taking into account the age differences, are categorised as “developmentally vulnerable”. In Queensland, the percentage of children who are developmentally vulnerable in at least one domain in 2018 was around 26%, and the overall percentage of attendance at preschool was around 75.4% These are the lowest rates among all states and territories of Australia. There is also substantial geographic variation in rates across the state.

In this study, the outcome variable of interest is the SA2 level development vulnerability score for each domain, which is the age matched proportion of developmentally vulnerable children in the SA2. Five development vulnerabilities were considered in this study. These include: physical health and well-being domain vulnerability (PHD), social competence domain vulnerability (SCD), emotional maturity domain vulnerability (EMD), language and cognitive skills domain vulnerability (LCS), communication skills and general knowledge domain vulnerability (CS), and two development domain indicators which are vulnerable on one or more domain(VOD), and vulnerable on two or more domains (VTD).

The covariate information was extracted from the ABS and AEDC for each SA2. The covariates of interest obtained from the ABS included a geographic remoteness category, a Socio-Economic Index for Area (SEIFA) score, specifically an Index of Relative Socio-Economic Disadvantage (IRSD), mother’s language, country of birth, Indigenous status, and attendance at preschool. These covariates are also gathered as part of the survey AEDC and aggregated for research purposes.

The ABS classification of geographical remoteness is major city, inner-regional, outer-regional, remote and very remote. In Queensland there are 294 SA2 areas categorised as major cites, 113 SA2 areas as inner regional, 96 SA2 areas as outer regional, 11 SA2 areas as remote and 14 SA2 areas as very remote area [18].

The SEIFA score is a broad socioeconomic index that summarises a variety of data on individual and family economic and social condition in a given area. This factor is coded from 1 to 10. A low score suggests that the area in general is at a disadvantage. For example, low-income households, or people without qualifications or in low skill occupations.

Binary classifications were used for mother’s language (English, other), Indigenous status (Indigenous, not), Country of birth (Australia/not Australia) and attendance at preschool (yes, no).

The data custodians listed the above data over different time periods. In this study, we collect annual data only from 2018-2019. This study used the latest publicly available data from the 2018-2019 census. All count covariates acquired in this study have been transformed into proportions of children in an SA2 region with the feature of interest. Between 3% and 6% of the data were missing variables in the dataset. Missing continuous data was imputed using spatial neighbourhood averages. For categorical data, imputation was instead taken as the highest frequency neighbourhood category. In two instances, missing values for two islands could not be filled, as the regions have no contiguous neighbours. As a result, the analysis carried out in this study was reduced to the remaining 526 SA2 regions.

Overall measures of spatial variation

Moran’s I [19] and Geary’s C [12] are popular measures to determine whether the data are geographically clustered, randomly distributed, or uniformly distributed in space. The semi-variogram, which depicts the range and rate at which spatial autocorrelation decreases, is another tool for measuring spatial dependency [20]. The semi-variance of a dataset with spatial autocorrelation typically grows to a maximum value before levelling off. The range of Moran’s I is between -1 and 1, where -1 is perfect dissimilarity clustering, 0 means that there is no spatial autocorrelation, and 1 indicates perfect similarity clustering.

Tangos’ maximized excess events test (MEET) [21] is another way to detect the spatial variation inside the data. This measure assumes a range of spatial scale parameters and depends on a weight function. Tango’s (MEET) has been shown to have very good statistical power in detecting global disease clustering [21]. Tango [22] proposed a distance based exponential weight function for MEET, but other choices of weights are also possible. one feature of this test is that it considers a range of spatial scale parameters, adjusting for the multiple testing Tango’s (MEET) has been shown to have very good statistical power in detecting global disease clustering. For more details see the Appendix.

Statistical machine learning algorithms

Random forests for spatial data

A number of approaches have been proposed for applying a random forest to spatial data. Longitude and latitude were introduced as covariates in several efforts to integrate a spatial context into machine learning [13, 23, 24]. For example, Behrens [13] used x- and y-coordinates and distances to the corners and center of a bounding box around the sampling locations as covariates. Random Forest for Spatial Prediction (RFsp) was developed by Hengl [9], and uses buffer distance maps from observation points as covariates. In the next section we discuss another popular approach, the geographical random forest (GRF).

Geographical random forests

The GRF is a disaggregation consisting of several local sub-models [14]. It uses a similar idea to geographical weighted regression (GWR) [25]. Here, a local RF is computed for each location i based only on nearby observations. Thus for each training data point, a RF is developed, each with its own efficiency, predictive ability, and feature importance. As a result, the stability of the RF is measured locally rather than globally.

A GRF can be used to achieve two goals: firstly to enhance predictions over a standard RF, and secondly to extract spatially differentiated model parameter inferences. The degree of spatial variation in the data and the required bandwidth selection determine the increase in efficiency. Moreover, a GRF model can be used as a simple guide to investigate the data’s local structure and improve our understanding of how spatial processes affect this structure. For more details see the Appendix.

Neural networks for spatial data

One way of using neural networks for spatial data is to use the longitude and latitude as a covariate. We call this method a spatial neural network (SNN). Another recent extension of NN for spatial data is the geographically weighted artificial neural network (GWANN) [26]. Each output neuron of GWANN has as a geographic location associated to it. This allows the spatial distances between the observations and the output neuron’s location to be calculated. As a result, the connection weights between the hidden and output layers can be understood as a geographical weighted regression GWR model when estimated using a geographically weighted error function.

Garson [27] devised a method for calculating the relative importance of each of the input variables based on the connection weights. In this algorithm each variable’s input is stored as a weight in the network model, and the contribution of each of these variables to the output is largely determined by the magnitude and direction of these link weights. A positive connection weight enhances the magnitude of the network output, whereas a negative weight suppresses the value of the response variable [28]. For more details see the Appendix.

Linear models for spatial data

The generalized linear model (GLM) can be extended to include non-normal responses via a generalized linear model, or additive terms via generalised additive model GAM. A spatial GLM or a spatial GAM is another way to model the spatial data. Non-Gaussian error distributions and non-linear correlations between response and predictor variables are supported by these regression techniques.

In the most simple form, latitude and longitude can be used as model inputs [29].

The spatial autoregressive (SAR) model proposed by Whittle [30] is a spatial approach for describing the connection between dependent and independent variables by taking the spatial effect into account. It features an autoregressive structure that represents the spatial dependency of the attributes using a precision matrix that is generally a function of the proximity between regions [31]. Moran’s I [19] can be used to confirm the presence of spatial variation before the SAR model is used. Weights are used to indicate the impact of location effects on the data [32]. For more details see the Appendix.

Conditional Autoregressive Model (CAR)

Bayesian models are especially well adapted to spatial modelling because the information particular to each region may be represented as priors, and both correlated and uncorrelated spatial effects can be investigated [33]. For more details see the Appendix.

Non of the aforementioned algorithms can explain the spatial autocorrelation. Spatial autocorrelation in data can inflate bias in statistical analyses [15, 34, 35]. Failing to appropriately address this issue will likely lead to three major statistical problems. First, the standard errors might be underestimated. Consequently, that will make the regression model itself unreliable [36, 37]. Second, parameter estimates, such as the regression coefficients might be biased [38]. The inflation or deflation of predictors’ coefficients will induce the over or under-estimation, respectively, of their predictive power [39].