The mgwrsar package is designed to estimate linear and
local linear models, with or without spatial autocorrelation. While the
package handles complex Spatial Autoregressive (SAR) specifications,
this vignette focuses on canonical Geographically Weighted
Regression (GWR) and Mixed-GWR (where some
coefficients are constant over space) without a spatial lag term.
The general formulations are:
GWR Model: \[ y_i = \beta_0(u_i,v_i) + \sum_{k=1}^p \beta_k(u_i,v_i) X_{ik} + \epsilon_i \]
Mixed-GWR Model: \[ y_i = \sum_{j=1}^q \gamma_j Z_{ij} + \beta_0(u_i,v_i) + \sum_{k=1}^p \beta_k(u_i,v_i) X_{ik} + \epsilon_i \]
Where: * \((u_i, v_i)\) are the coordinates of observation \(i\). * \(\beta_k(u_i,v_i)\) are spatially varying coefficients. * \(\gamma_j\) are constant (fixed) coefficients for variables \(Z\).
For a deeper understanding of the estimation methods (including SAR specifications), please refer to Geniaux, G. and Martinetti, D. (2018).
MGWRSAR functionThe estimation is performed using the MGWRSAR wrapper
function. Key capabilities include:
data.frame with a
coordinates matrix, or spatial objects like sf,
SpatialPointsDataFrame, etc.golden_search_bandwidth function identifies the optimal
bandwidth using AICc (corrected Akaike Information
Criterion) or CV (Cross-Validation).Type='GD') and
spatio-temporal (Type='GDT') data.Rcpp and RcppEigen for speed, supports
parallel computing via doParallel, and offers “Target
Points” estimation strategies (see vignette
‘Speeding_up_GWR_like_models’ and Geniaux
(2024)).We use the mydatasf dataset included in the package. It
contains real estate data (single houses) from the South of France with
the following variables:
price: Price in euros (Dependent Variable).year: Year of sale.footage: Total living area.land_area: Area of the land plot.pbb45: Percentage of houses built before 1945 in the
surrounding km².## Loading data example
data(mydatasf)
# Quick visualization of the dependent variable
plot(mydatasf["price"], main = "House Prices Distribution", pch = 19, cex = 0.6)
# Prepare data for usage
mydata <- st_drop_geometry(mydatasf)
coords <- st_coordinates(mydatasf)
Estimating a GWR model requires selecting an optimal bandwidth. We
use golden_search_bandwidth to find this optimum based on a
specific criterion (AICc or CV).
Here we search for an optimal distance (in map units, e.g., meters) using a Bisquare kernel and AICc.
# Bandwidth search: Non-adaptive (fixed distance), Bisquare kernel, AICc
res_AIC_fixed <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("bisq"),Model = "GWR",control = list( adaptive = FALSE, criterion = 'AICc', verbose = FALSE),hs_range = c(100,50000))
cat("Optimal Bandwidth (Distance):", res_AIC_fixed$best_model@H, "\n")
#> Optimal Bandwidth (Distance): 735.4214
cat("Computation Time:", res_AIC_fixed$ctime, "s\n")
#> Computation Time: 3.946 s
# The function returns the best model directly
model_GWR_fixed <- res_AIC_fixed$best_model
summary(model_GWR_fixed)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars,
#> kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#>
#> Model: GWR
#> ------------------------------------------------------
#> Kernels Configuration:
#> Spatial Kernel : bisq (Fixed / Distance)
#> ------------------------------------------------------
#> Bandwidth Configuration:
#> Spatial Bandwidth (H): 735.42
#> ------------------------------------------------------
#> Model Settings:
#> Computation time: 0.076 sec
#> Use of Target Points: NO
#> Use of rough kernel approximation: NO
#> Parallel computing: (ncore = 1)
#> Number of data points: 1403
#> ------------------------------------------------------
#> Coefficients Summary:
#> [Varying Parameters]
#> Intercept year footage land_area
#> Min. :-2303764 Min. :-53026 Min. :-5253 Min. :-496.80
#> 1st Qu.: 2676 1st Qu.: 1924 1st Qu.: 1391 1st Qu.: 30.41
#> Median : 40071 Median : 3394 Median : 1760 Median : 40.41
#> Mean : 16867 Mean : 4598 Mean : 1933 Mean : 46.69
#> 3rd Qu.: 61272 3rd Qu.: 6199 3rd Qu.: 2286 3rd Qu.: 47.94
#> Max. : 2091491 Max. :110561 Max. : 7429 Max. : 401.55
#> ------------------------------------------------------
#> Diagnostics:
#> Effective degrees of freedom: 1238.51
#> AICc: 36609.91
#> Residual sum of squares: 1.352877e+13
#> RMSE: 98197.49
#> ------------------------------------------------------
Here we search for an optimal number of neighbors using a Gaussian kernel and AICc.
# Bandwidth search: Adaptive (neighbors), Gaussian kernel, AICc
res_AIC_adaptive <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("gauss"),Model = "GWR",control = list( adaptive = TRUE, criterion = 'AICc', verbose = FALSE,NN=500),hs_range = c(5,500))
cat("Optimal Bandwidth (Neighbors):", res_AIC_adaptive$minimum, "\n")
#> Optimal Bandwidth (Neighbors): 12
cat("Computation Time:", res_AIC_adaptive$ctime, "s\n")
#> Computation Time: 1.887 s
model_GWR_adaptive_aic <- res_AIC_adaptive$best_model
summary(model_GWR_adaptive_aic)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars,
#> kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#>
#> Model: GWR
#> ------------------------------------------------------
#> Kernels Configuration:
#> Spatial Kernel : gauss (Adaptive / Neighbors)
#> ------------------------------------------------------
#> Bandwidth Configuration:
#> Spatial Bandwidth (H): 12
#> ------------------------------------------------------
#> Model Settings:
#> Computation time: 0.062 sec
#> Use of Target Points: NO
#> Use of rough kernel approximation: YES (500 neighbors)
#> Parallel computing: (ncore = 1)
#> Number of data points: 1403
#> ------------------------------------------------------
#> Coefficients Summary:
#> [Varying Parameters]
#> Intercept year footage land_area
#> Min. :-534813 Min. :-10861 Min. : 20.66 Min. : -26.13
#> 1st Qu.: -23587 1st Qu.: 1559 1st Qu.:1216.03 1st Qu.: 21.50
#> Median : 41463 Median : 4036 Median :1707.25 Median : 39.23
#> Mean : 20655 Mean : 4610 Mean :1854.59 Mean : 87.32
#> 3rd Qu.: 80401 3rd Qu.: 6700 3rd Qu.:2340.79 3rd Qu.: 65.92
#> Max. : 335102 Max. : 28532 Max. :4797.96 Max. :1695.35
#> ------------------------------------------------------
#> Diagnostics:
#> Effective degrees of freedom: 1154.01
#> AICc: 36676.18
#> Residual sum of squares: 1.20127e+13
#> RMSE: 92531.89
#> ------------------------------------------------------
Cross-validation is an alternative to AICc, often preferred when the focus is on predictive performance.
# Bandwidth search: Adaptive (neighbors), Gaussian kernel, CV
res_CV_adaptive <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("gauss"),Model = "GWR",control = list( adaptive = TRUE, criterion = 'CV', verbose = FALSE,NN=500),hs_range = c(5,500))
cat("Optimal Bandwidth (Neighbors via AICc):", res_CV_adaptive$minimum, "\n")
#> Optimal Bandwidth (Neighbors via AICc): 43
cat("Computation Time:", res_CV_adaptive$ctime, "s\n")
#> Computation Time: 2.532 s
model_GWR_final <- res_CV_adaptive$best_model
We can visualize the spatially varying coefficients.
MGWRSAR provides plotting methods that can be wrapped in
plotly for interactivity.
# print crs of mydatasf : "EPSG:2154"
st_crs(mydatasf)$input
#> [1] "EPSG:2154"
# Plot Intercept
p1 <- plot(model_GWR_final, type = 'B_coef', var = 'Intercept',crs = "EPSG:2154")
plotly::partial_bundle(p1)
# Plot Coefficient for 'footage'
p2 <- plot(model_GWR_final, type = 'B_coef', var = 'footage',crs = "EPSG:2154")
plotly::partial_bundle(p2)
To compute Standard Errors (SE) for the coefficients, we must
re-estimate the model setting SE = TRUE in the control
list. Note that this is computationally more demanding.
# Re-estimating with SE=TRUE using the previously found bandwidth (H=12 approx from previous steps, strictly one should use the exact H found)
optimal_H <- model_GWR_final@H
model_GWR_SE <- MGWRSAR(
formula = 'price ~ year + footage + land_area',
data = mydata,
coords = coords,
fixed_vars = NULL,
kernels = c('gauss'),
H = optimal_H,
Model = 'GWR',
control = list(adaptive = TRUE, verbose = FALSE, SE = TRUE)
)
summary(model_GWR_SE)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = "price ~ year + footage + land_area", data = mydata,
#> coords = coords, fixed_vars = NULL, kernels = c("gauss"),
#> H = optimal_H, Model = "GWR", control = list(adaptive = TRUE,
#> verbose = FALSE, SE = TRUE))
#>
#> Model: GWR
#> ------------------------------------------------------
#> Kernels Configuration:
#> Spatial Kernel : gauss (Adaptive / Neighbors)
#> ------------------------------------------------------
#> Bandwidth Configuration:
#> Spatial Bandwidth (H): 43
#> ------------------------------------------------------
#> Model Settings:
#> Computation time: 0.28 sec
#> Use of Target Points: NO
#> Use of rough kernel approximation: NO
#> Parallel computing: (ncore = 1)
#> Number of data points: 1403
#> ------------------------------------------------------
#> Coefficients Summary:
#> [Varying Parameters]
#> Intercept year footage land_area
#> Min. :-174254.2 Min. : -155.1 Min. : 922.3 Min. : 12.79
#> 1st Qu.: -862.6 1st Qu.: 1806.5 1st Qu.:1462.8 1st Qu.: 34.06
#> Median : 42295.1 Median : 3198.7 Median :1791.1 Median : 39.14
#> Mean : 24129.1 Mean : 4044.1 Mean :1855.4 Mean : 74.15
#> 3rd Qu.: 62966.6 3rd Qu.: 5741.2 3rd Qu.:2045.8 3rd Qu.: 51.22
#> Max. : 118898.3 Max. :14190.8 Max. :3732.8 Max. :1012.05
#> ------------------------------------------------------
#> Diagnostics:
#> Effective degrees of freedom: 1325.98
#> AICc: 36779.28
#> Residual sum of squares: 1.772893e+13
#> RMSE: 112411.9
#> ------------------------------------------------------
Plotting t-statistics:
# Plot t-statistic for 'year'
p3 <- plot(model_GWR_SE, type = 't_coef', var = 'year',crs="EPSG:2154")
plotly::partial_bundle(p3)
# Visualize effects: x_i * beta_i
plot_effect(model_GWR_SE, title = 'Variable Effects')
In a Mixed-GWR model, some coefficients are held constant (global) while others vary spatially. This is useful when certain variables are known to have a global impact or when testing for stationarity.
Here, we treat year as a fixed
variable.
# Optimal bandwidth for Mixed-GWR (fixed_vars = 'year')
res_MGWR_AICc <- search_bandwidths(formula ='price ~ year + footage + land_area',data = mydata,coords = coords,kernels = c("gauss"),fixed_vars = 'year',Model = "MGWR",control = list( adaptive = TRUE, criterion = 'AICc', verbose = FALSE,NN=500),hs_range = c(5,500))
model_MGWR <- res_MGWR_AICc$best_model
summary(model_MGWR)
#> ------------------------------------------------------
#> Call:
#> MGWRSAR(formula = formula, data = data, coords = coords, fixed_vars = fixed_vars,
#> kernels = kernels, H = c(res, Ht), Model = Model, control = control_o)
#>
#> Model: MGWR
#> ------------------------------------------------------
#> Kernels Configuration:
#> Spatial Kernel : gauss (Adaptive / Neighbors)
#> ------------------------------------------------------
#> Bandwidth Configuration:
#> Spatial Bandwidth (H): 8
#> ------------------------------------------------------
#> Model Settings:
#> Computation time: 0.053 sec
#> Use of Target Points: NO
#> Use of rough kernel approximation: YES (500 neighbors)
#> Parallel computing: (ncore = 1)
#> Number of data points: 1403
#> ------------------------------------------------------
#> Coefficients Summary:
#> [Constant Parameters]
#> year
#> 4771.84
#>
#> [Varying Parameters]
#> Intercept footage land_area
#> Min. :-698313 Min. :-3035 Min. :-952.39
#> 1st Qu.: -32442 1st Qu.: 1167 1st Qu.: 14.84
#> Median : 23952 Median : 1754 Median : 37.01
#> Mean : 16788 Mean : 1890 Mean : 95.88
#> 3rd Qu.: 72818 3rd Qu.: 2479 3rd Qu.: 72.63
#> Max. : 454408 Max. : 5852 Max. :1962.05
#> ------------------------------------------------------
#> Diagnostics:
#> Effective degrees of freedom: 1118.62
#> AICc: 36611.34
#> Residual sum of squares: 1.061961e+13
#> RMSE: 87001.24
#> ------------------------------------------------------
The mgwrsar_bootstrap_test function allows testing for
spatial non-stationarity of coefficients. This is computationally
intensive as it involves multiple re-estimations.
Note: The code below is set to eval=FALSE to save
compilation time.
# 1. Stationarity Test for 'year'
# Compare GWR (year varies) vs MGWR (year fixed)
# Assuming model_GWR and model_MGWR are estimated with their respective optimal bandwidths
test_stationarity <- mgwrsar_bootstrap_test(
model_MGWR, # H0 model (constrained)
model_GWR_SE, # H1 model (unconstrained)
B = 30, # Number of bootstrap replications (increase for real analysis)
type = 'spatial',
ncore = 1,
D=as.matrix(dist(coords))
)
print(test_stationarity)
#> $p_star
#> [1] 1
#>
#> $T
#> [1] 3.788559
# If p_star > 0.05, we cannot reject H0: 'year' is globally stationary.
# 2. Significance Test
# Compare model with variable vs model without variable
# Example: Is 'year' significant?
res_GWR_reduced <- golden_search_bandwidth(
formula = 'price ~ footage + land_area',
data = mydata, coords = coords, fixed_vars = NULL, kernels = c('gauss'),
lower.bound = 2,
upper.bound = 500,
Model = 'GWR', control = list(adaptive = T, criterion = 'AICc'),show_progress = F
)
model_GWR_reduced <- res_GWR_reduced$best_model
test_significance <- mgwrsar_bootstrap_test(
model_GWR_reduced,
model_GWR_SE,
B = 30,
type = 'spatial',
ncore = 1,
D=as.matrix(dist(coords))
)
print(test_significance)
#> $p_star
#> [1] 0
#>
#> $T
#> [1] 1.912824
# If p_star < 0.05, we reject H0: The variable 'year' adds significant information.
Out-of-sample prediction in GWR requires calculating coefficients at
new locations. mgwrsar offers several methods:
method_pred = 'tWtp_model'): Calculates weights
between target points (out-sample) and calibration points (in-sample)
using the kernel/bandwidth of the model. This is the robust approach
described in Geniaux (2024).method_pred = 'shepard'): Inverse distance
weighting of nearby existing coefficients.set.seed(1)
# Split data into In-Sample (Calibration) and Out-Sample (Validation)
n <- nrow(mydata)
index_in <- sample(1:n, 800)
index_out <- setdiff(1:n, index_in)
data_in <- mydata[index_in, ]
coords_in <- coords[index_in, ]
data_out <- mydata[index_out, ]
coords_out <- coords[index_out, ]
# 1. Train GWR model on In-Sample data
res_pred_train <- golden_search_bandwidth(
formula = 'price ~ year + footage + land_area',
data = data_in,
coords = coords_in,
kernels = c('gauss'),
Model = 'GWR',
control = list(adaptive = TRUE, criterion = 'AICc', verbose = FALSE),
lower.bound = 5, upper.bound = 200,show_progress = F
)
model_train <- res_pred_train$best_model
# Method A: Standard Prediction
# This computes the local coefficients at new locations based on the training data
Y_pred <- predict(model_train, newdata = data_out, newdata_coords = coords_out)
rmse_default <- rmse(data_out$price, Y_pred)
# Method B: Shepard Interpolation of Coefficients
# Faster but potentially less accurate as it interpolates betas, not the local regression
Y_pred_shepard <- predict(model_train, newdata = data_out, newdata_coords = coords_out, method_pred = 'shepard')
rmse_shepard <- rmse(data_out$price, Y_pred_shepard)
# Method C: using tWtp_model by default for GWR objects in newer versions
Y_pred_tWtp_model <- predict(model_train, newdata = data_out, newdata_coords = coords_out, method_pred = 'tWtp_model')
rmse_tWtp_model <- rmse(data_out$price, Y_pred_tWtp_model)
cat("RMSE default:", rmse_default, "\n")
#> RMSE default: 905087.2
cat("RMSE (Shepard): ", rmse_shepard, "\n")
#> RMSE (Shepard): 112780.9
cat("RMSE (tWtp_model): ", rmse_tWtp_model, "\n")
#> RMSE (tWtp_model): 112512.1