Soil moisture (SM) is a key variable in hydrological studies, but its strong spatial heterogeneity limits accurate large-scale characterization. Gridded SM products (remotely sensed/ reanalysis) provide broad coverage, yet they suffer from retrieval and model biases. Machine-learning (ML) methods are increasingly used for bias correction, but their performance depends strongly on the availability and representativeness of training data. Most existing ML-based SM products are trained using observations from the International Soil Moisture Network (ISMN), which has sparse coverage across Canada. In situ observations outside ISMN, referred to here as CanObs, therefore represent an underutilized resource. We evaluate reanalysis-, satellite-, and ML-based gridded SM products against CanObs observations across Canada. ERA5-Land and SMAP exhibit substantial errors, particularly in forested and shrubland regions, while existing ML-based products systematically underestimate SM at most CanObs sites. To assess the impact of training data, we developed two random forest models trained separately on ISMN and CanObs observations and evaluated their spatial and temporal transferability. Models trained on CanObs consistently outperform ISMN-trained models at independent Canadian sites. The CanObs-trained model reduced the unbiased RMSE of ERA5-Land SM from 19% to 6.9%. We further show that poor spatial transferability is linked to limited overlap in feature distributions between training and testing datasets. ISMN-trained models rely mainly on large-scale predictors, whereas local factors such as topography and soil texture dominate at CanObs sites. These results highlight the importance of regionally representative training data for improving ML-based SM estimation in data-sparse environments.
Halifax NS
Canada