Introduction

The mgwrsar package is designed to estimate linear and local linear models, with or without spatial autocorrelation. While the package handles complex Spatial Autoregressive (SAR) specifications, this vignette focuses on canonical Geographically Weighted Regression (GWR) and Mixed-GWR (where some coefficients are constant over space) without a spatial lag term.

The general formulations are:

GWR Model: \[ y_i = \beta_0(u_i,v_i) + \sum_{k=1}^p \beta_k(u_i,v_i) X_{ik} + \epsilon_i \]

Mixed-GWR Model: \[ y_i = \sum_{j=1}^q \gamma_j Z_{ij} + \beta_0(u_i,v_i) + \sum_{k=1}^p \beta_k(u_i,v_i) X_{ik} + \epsilon_i \]

Where: * \((u_i, v_i)\) are the coordinates of observation \(i\). * \(\beta_k(u_i,v_i)\) are spatially varying coefficients. * \(\gamma_j\) are constant (fixed) coefficients for variables \(Z\).

For a deeper understanding of the estimation methods (including SAR specifications), please refer to Geniaux, G. and Martinetti, D. (2018).

Key Features of the MGWRSAR function

The estimation is performed using the MGWRSAR wrapper function. Key capabilities include:

  • Input Data: Accepts data.frame with a coordinates matrix, or spatial objects like sf, SpatialPointsDataFrame, etc.
  • Bandwidth Optimization: The golden_search_bandwidth function identifies the optimal bandwidth using AICc (corrected Akaike Information Criterion) or CV (Cross-Validation).
  • Kernels: Supports various kernels (Gaussian, Bisquare, Tricube, etc.) for both spatial (Type='GD') and spatio-temporal (Type='GDT') data.
  • Scalability: For large datasets, the package employs Rcpp and RcppEigen for speed, supports parallel computing via doParallel, and offers “Target Points” estimation strategies (see vignette ‘Speeding_up_GWR_like_models’ and Geniaux (2024)).

Data Example

We use the mydatasf dataset included in the package. It contains real estate data (single houses) from the South of France with the following variables:

  • price: Price in euros (Dependent Variable).
  • year: Year of sale.
  • footage: Total living area.
  • land_area: Area of the land plot.
  • pbb45: Percentage of houses built before 1945 in the surrounding km².
## Loading data example
data(mydatasf)

# Quick visualization of the dependent variable
plot(mydatasf["price"], main = "House Prices Distribution", pch = 19, cex = 0.6)


# Prepare data for usage
mydata <- st_drop_geometry(mydatasf)
coords <- st_coordinates(mydatasf)

GWR: Bandwidth Optimization and Estimation

Estimating a GWR model requires selecting an optimal bandwidth. We use golden_search_bandwidth to find this optimum based on a specific criterion (AICc or CV).

1. Non-Adaptive Kernel (Fixed Distance)

Here we search for an optimal distance (in map units, e.g., meters) using a Bisquare kernel and AICc.

# Bandwidth search: Non-adaptive (fixed distance), Bisquare kernel, AICc
res_AIC_fixed <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("bisq"),Model = "GWR",control = list( adaptive = FALSE, criterion = 'AICc', verbose = FALSE),hs_range = c(100,50000))

cat("Optimal Bandwidth (Distance):", res_AIC_fixed$best_model@H, "\n")
#> Optimal Bandwidth (Distance): 735.4214
cat("Computation Time:", res_AIC_fixed$ctime, "s\n")
#> Computation Time: 3.946 s

# The function returns the best model directly
model_GWR_fixed <- res_AIC_fixed$best_model
summary(model_GWR_fixed)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars, 
#>     kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#> 
#> Model: GWR 
#> ------------------------------------------------------
#> Kernels Configuration:
#>    Spatial Kernel  : bisq (Fixed / Distance) 
#> ------------------------------------------------------
#> Bandwidth Configuration:
#>    Spatial Bandwidth (H): 735.42 
#> ------------------------------------------------------
#> Model Settings:
#>    Computation time: 0.076 sec
#>    Use of Target Points: NO 
#>    Use of rough kernel approximation: NO 
#>    Parallel computing: (ncore = 1) 
#>    Number of data points: 1403 
#> ------------------------------------------------------
#> Coefficients Summary:
#>    [Varying Parameters]
#>    Intercept             year           footage        land_area      
#>  Min.   :-2303764   Min.   :-53026   Min.   :-5253   Min.   :-496.80  
#>  1st Qu.:    2676   1st Qu.:  1924   1st Qu.: 1391   1st Qu.:  30.41  
#>  Median :   40071   Median :  3394   Median : 1760   Median :  40.41  
#>  Mean   :   16867   Mean   :  4598   Mean   : 1933   Mean   :  46.69  
#>  3rd Qu.:   61272   3rd Qu.:  6199   3rd Qu.: 2286   3rd Qu.:  47.94  
#>  Max.   : 2091491   Max.   :110561   Max.   : 7429   Max.   : 401.55  
#> ------------------------------------------------------
#> Diagnostics:
#>    Effective degrees of freedom: 1238.51 
#>    AICc: 36609.91 
#>    Residual sum of squares: 1.352877e+13 
#>    RMSE: 98197.49 
#> ------------------------------------------------------

2. Adaptive Kernel (Number of Neighbors)

Here we search for an optimal number of neighbors using a Gaussian kernel and AICc.

# Bandwidth search: Adaptive (neighbors), Gaussian kernel, AICc

res_AIC_adaptive <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("gauss"),Model = "GWR",control = list( adaptive = TRUE, criterion = 'AICc', verbose = FALSE,NN=500),hs_range = c(5,500))


cat("Optimal Bandwidth (Neighbors):", res_AIC_adaptive$minimum, "\n")
#> Optimal Bandwidth (Neighbors): 12
cat("Computation Time:", res_AIC_adaptive$ctime, "s\n")
#> Computation Time: 1.887 s

model_GWR_adaptive_aic <- res_AIC_adaptive$best_model
summary(model_GWR_adaptive_aic)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars, 
#>     kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#> 
#> Model: GWR 
#> ------------------------------------------------------
#> Kernels Configuration:
#>    Spatial Kernel  : gauss (Adaptive / Neighbors) 
#> ------------------------------------------------------
#> Bandwidth Configuration:
#>    Spatial Bandwidth (H): 12 
#> ------------------------------------------------------
#> Model Settings:
#>    Computation time: 0.062 sec
#>    Use of Target Points: NO 
#>    Use of rough kernel approximation: YES (500 neighbors) 
#>    Parallel computing: (ncore = 1) 
#>    Number of data points: 1403 
#> ------------------------------------------------------
#> Coefficients Summary:
#>    [Varying Parameters]
#>    Intercept            year           footage          land_area      
#>  Min.   :-534813   Min.   :-10861   Min.   :  20.66   Min.   : -26.13  
#>  1st Qu.: -23587   1st Qu.:  1559   1st Qu.:1216.03   1st Qu.:  21.50  
#>  Median :  41463   Median :  4036   Median :1707.25   Median :  39.23  
#>  Mean   :  20655   Mean   :  4610   Mean   :1854.59   Mean   :  87.32  
#>  3rd Qu.:  80401   3rd Qu.:  6700   3rd Qu.:2340.79   3rd Qu.:  65.92  
#>  Max.   : 335102   Max.   : 28532   Max.   :4797.96   Max.   :1695.35  
#> ------------------------------------------------------
#> Diagnostics:
#>    Effective degrees of freedom: 1154.01 
#>    AICc: 36676.18 
#>    Residual sum of squares: 1.20127e+13 
#>    RMSE: 92531.89 
#> ------------------------------------------------------

3. Adaptive gaussian Kernel with CV

Cross-validation is an alternative to AICc, often preferred when the focus is on predictive performance.

# Bandwidth search: Adaptive (neighbors), Gaussian kernel, CV
res_CV_adaptive <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("gauss"),Model = "GWR",control = list( adaptive = TRUE, criterion = 'CV', verbose = FALSE,NN=500),hs_range = c(5,500))


cat("Optimal Bandwidth (Neighbors via AICc):", res_CV_adaptive$minimum, "\n")
#> Optimal Bandwidth (Neighbors via AICc): 43
cat("Computation Time:", res_CV_adaptive$ctime, "s\n")
#> Computation Time: 2.532 s

model_GWR_final <- res_CV_adaptive$best_model

Visualization of Results

We can visualize the spatially varying coefficients. MGWRSAR provides plotting methods that can be wrapped in plotly for interactivity.


# print crs of mydatasf : "EPSG:2154"
st_crs(mydatasf)$input
#> [1] "EPSG:2154"

# Plot Intercept
p1 <- plot(model_GWR_final, type = 'B_coef', var = 'Intercept',crs = "EPSG:2154")
plotly::partial_bundle(p1)

# Plot Coefficient for 'footage'
p2 <- plot(model_GWR_final, type = 'B_coef', var = 'footage',crs = "EPSG:2154")
plotly::partial_bundle(p2)

Standard Errors Computation

To compute Standard Errors (SE) for the coefficients, we must re-estimate the model setting SE = TRUE in the control list. Note that this is computationally more demanding.

# Re-estimating with SE=TRUE using the previously found bandwidth (H=12 approx from previous steps, strictly one should use the exact H found)
optimal_H <- model_GWR_final@H

model_GWR_SE <- MGWRSAR(
  formula = 'price ~ year + footage + land_area',
  data = mydata,
  coords = coords,
  fixed_vars = NULL,
  kernels = c('gauss'),
  H = optimal_H,
  Model = 'GWR',
  control = list(adaptive = TRUE, verbose = FALSE, SE = TRUE)
)

summary(model_GWR_SE)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = "price ~ year + footage + land_area", data = mydata, 
#>     coords = coords, fixed_vars = NULL, kernels = c("gauss"), 
#>     H = optimal_H, Model = "GWR", control = list(adaptive = TRUE, 
#>         verbose = FALSE, SE = TRUE))
#> 
#> Model: GWR 
#> ------------------------------------------------------
#> Kernels Configuration:
#>    Spatial Kernel  : gauss (Adaptive / Neighbors) 
#> ------------------------------------------------------
#> Bandwidth Configuration:
#>    Spatial Bandwidth (H): 43 
#> ------------------------------------------------------
#> Model Settings:
#>    Computation time: 0.28 sec
#>    Use of Target Points: NO 
#>    Use of rough kernel approximation: NO 
#>    Parallel computing: (ncore = 1) 
#>    Number of data points: 1403 
#> ------------------------------------------------------
#> Coefficients Summary:
#>    [Varying Parameters]
#>    Intercept              year            footage         land_area      
#>  Min.   :-174254.2   Min.   : -155.1   Min.   : 922.3   Min.   :  12.79  
#>  1st Qu.:   -862.6   1st Qu.: 1806.5   1st Qu.:1462.8   1st Qu.:  34.06  
#>  Median :  42295.1   Median : 3198.7   Median :1791.1   Median :  39.14  
#>  Mean   :  24129.1   Mean   : 4044.1   Mean   :1855.4   Mean   :  74.15  
#>  3rd Qu.:  62966.6   3rd Qu.: 5741.2   3rd Qu.:2045.8   3rd Qu.:  51.22  
#>  Max.   : 118898.3   Max.   :14190.8   Max.   :3732.8   Max.   :1012.05  
#> ------------------------------------------------------
#> Diagnostics:
#>    Effective degrees of freedom: 1325.98 
#>    AICc: 36779.28 
#>    Residual sum of squares: 1.772893e+13 
#>    RMSE: 112411.9 
#> ------------------------------------------------------

Plotting t-statistics:


# Plot t-statistic for 'year'
p3 <- plot(model_GWR_SE, type = 't_coef', var = 'year',crs="EPSG:2154")
plotly::partial_bundle(p3)

# Visualize effects: x_i * beta_i
plot_effect(model_GWR_SE, title = 'Variable Effects')

Mixed-GWR: Estimation

In a Mixed-GWR model, some coefficients are held constant (global) while others vary spatially. This is useful when certain variables are known to have a global impact or when testing for stationarity.

Here, we treat year as a fixed variable.

# Optimal bandwidth for Mixed-GWR (fixed_vars = 'year')
res_MGWR_AICc <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("gauss"),fixed_vars = 'year',Model = "MGWR",control = list( adaptive = TRUE, criterion = 'AICc', verbose = FALSE,NN=500),hs_range = c(5,500))

model_MGWR <- res_MGWR_AICc$best_model
summary(model_MGWR)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars, 
#>     kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#> 
#> Model: MGWR 
#> ------------------------------------------------------
#> Kernels Configuration:
#>    Spatial Kernel  : gauss (Adaptive / Neighbors) 
#> ------------------------------------------------------
#> Bandwidth Configuration:
#>    Spatial Bandwidth (H): 8 
#> ------------------------------------------------------
#> Model Settings:
#>    Computation time: 0.053 sec
#>    Use of Target Points: NO 
#>    Use of rough kernel approximation: YES (500 neighbors) 
#>    Parallel computing: (ncore = 1) 
#>    Number of data points: 1403 
#> ------------------------------------------------------
#> Coefficients Summary:
#>    [Constant Parameters]
#>    year 
#> 4771.84 
#> 
#>    [Varying Parameters]
#>    Intercept          footage        land_area      
#>  Min.   :-698313   Min.   :-3035   Min.   :-952.39  
#>  1st Qu.: -32442   1st Qu.: 1167   1st Qu.:  14.84  
#>  Median :  23952   Median : 1754   Median :  37.01  
#>  Mean   :  16788   Mean   : 1890   Mean   :  95.88  
#>  3rd Qu.:  72818   3rd Qu.: 2479   3rd Qu.:  72.63  
#>  Max.   : 454408   Max.   : 5852   Max.   :1962.05  
#> ------------------------------------------------------
#> Diagnostics:
#>    Effective degrees of freedom: 1118.62 
#>    AICc: 36611.34 
#>    Residual sum of squares: 1.061961e+13 
#>    RMSE: 87001.24 
#> ------------------------------------------------------

Bootstrap Tests

The mgwrsar_bootstrap_test function allows testing for spatial non-stationarity of coefficients. This is computationally intensive as it involves multiple re-estimations.

Note: The code below is set to eval=FALSE to save compilation time.

# 1. Stationarity Test for 'year'
# Compare GWR (year varies) vs MGWR (year fixed)
# Assuming model_GWR and model_MGWR are estimated with their respective optimal bandwidths

test_stationarity <- mgwrsar_bootstrap_test(
  model_MGWR, # H0 model (constrained)
  model_GWR_SE,  # H1 model (unconstrained)
  B = 30,     # Number of bootstrap replications (increase for real analysis)
  type = 'spatial',
  ncore = 1,
  D=as.matrix(dist(coords))
)
print(test_stationarity)
#> $p_star
#> [1] 1
#> 
#> $T
#> [1] 3.788559
# If p_star > 0.05, we cannot reject H0: 'year' is globally stationary.

# 2. Significance Test
# Compare model with variable vs model without variable
# Example: Is 'year' significant?
res_GWR_reduced <- golden_search_bandwidth(
  formula = 'price ~ footage + land_area', 
  data = mydata, coords = coords, fixed_vars = NULL, kernels = c('gauss'), 
  lower.bound = 2, 
  upper.bound = 500, 
  Model = 'GWR', control = list(adaptive = T, criterion = 'AICc'),show_progress = F
)
model_GWR_reduced <- res_GWR_reduced$best_model

test_significance <- mgwrsar_bootstrap_test(
  model_GWR_reduced, 
  model_GWR_SE, 
  B = 30, 
  type = 'spatial',
  ncore = 1,
  D=as.matrix(dist(coords))
)
print(test_significance)
#> $p_star
#> [1] 0
#> 
#> $T
#> [1] 1.912824
# If p_star < 0.05, we reject H0: The variable 'year' adds significant information.

Prediction

Out-of-sample prediction in GWR requires calculating coefficients at new locations. mgwrsar offers several methods:

  1. Re-estimation (method_pred = 'tWtp_model'): Calculates weights between target points (out-sample) and calibration points (in-sample) using the kernel/bandwidth of the model. This is the robust approach described in Geniaux (2024).
  2. Shepard’s Interpolation (method_pred = 'shepard'): Inverse distance weighting of nearby existing coefficients.
  3. Model re-fitting: If the full data is available, one can simply re-run the estimation targeting the new coordinates.
set.seed(1)

# Split data into In-Sample (Calibration) and Out-Sample (Validation)
n <- nrow(mydata)
index_in <- sample(1:n, 800)
index_out <- setdiff(1:n, index_in)

data_in <- mydata[index_in, ]
coords_in <- coords[index_in, ]

data_out <- mydata[index_out, ]
coords_out <- coords[index_out, ]

# 1. Train GWR model on In-Sample data
res_pred_train <- golden_search_bandwidth(
  formula = 'price ~ year + footage + land_area',
  data = data_in,
  coords = coords_in,
  kernels = c('gauss'),
  Model = 'GWR',
  control = list(adaptive = TRUE, criterion = 'AICc', verbose = FALSE),
  lower.bound = 5, upper.bound = 200,show_progress = F
)
model_train <- res_pred_train$best_model

Comparison of Prediction Methods

# Method A: Standard Prediction 
# This computes the local coefficients at new locations based on the training data
Y_pred <- predict(model_train, newdata = data_out, newdata_coords = coords_out)
rmse_default <- rmse(data_out$price, Y_pred)

# Method B: Shepard Interpolation of Coefficients
# Faster but potentially less accurate as it interpolates betas, not the local regression
Y_pred_shepard <- predict(model_train, newdata = data_out, newdata_coords = coords_out, method_pred = 'shepard')
rmse_shepard <- rmse(data_out$price, Y_pred_shepard)

# Method C: using tWtp_model by default for GWR objects in newer versions
Y_pred_tWtp_model <- predict(model_train, newdata = data_out, newdata_coords = coords_out, method_pred = 'tWtp_model')
rmse_tWtp_model <- rmse(data_out$price, Y_pred_tWtp_model)

cat("RMSE default:", rmse_default, "\n")
#> RMSE default: 905087.2
cat("RMSE (Shepard):   ", rmse_shepard, "\n")
#> RMSE (Shepard):    112780.9
cat("RMSE (tWtp_model):     ", rmse_tWtp_model, "\n")
#> RMSE (tWtp_model):      112512.1

References

  • Brunsdon, C., Fotheringham, A., Charlton, M. (1996). Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical Analysis, 28(4), 281–298.
  • Geniaux, G. and Martinetti, D. (2018). A new method for dealing simultaneously with spatial autocorrelation and spatial heterogeneity in regression models. Regional Science and Urban Economics, 72, 74-85. https://doi.org/10.1016/j.regsciurbeco.2017.04.001
  • Geniaux, G. (2024). Speeding up estimation of spatially varying coefficients models. Journal of Geographical Systems, 26, 293–327. https://doi.org/10.1007/s10109-024-00442-3
  • Goulard, M., Laurent, T., & Thomas-Agnan, C. (2017). About predictions in spatial autoregressive models: Optimal and almost optimal strategies. Spatial Economic Analysis, 12(2-3), 304-325.