22-27 September 2019
Trade Fairs and Congress Center (FYCMA)
Europe/Madrid timezone

Feature selection for mapping the probability of groundwater pollution using Random Forest

24 Sep 2019, 11:45
Conference room 1.A ()

Conference room 1.A

Oral Topic 8 - Groundwater quality and pollution processes Parallel


Dr Maria Paula Mendes (CERIS, Civil Engineering Research and Innovation for Sustainability, Instituto Superior Técnico, Universidade de Lisboa)


World total agricultural use of either chemical or mineral fertilizers was 110 Mt nitrogen (N) in 2016, reaching 69 kg N/ha the use of fertilizers per hectare of cropland (arable land and permanent crops) (FAO, 2018). The excessive use of nitrogen-containing fertilizers and manures is one of the main sources for the nitrate contamination of groundwater. WHO (2011) and the EU Water Framework Directive (2000) establish groundwater as polluted when nitrate concentration is equal or above the guideline value of 50 mg/L. Machine learning algorithms (MLAs) have been increasingly used to predict nitrate concentration in groundwater since they can recognize patterns between them and different features, learning from data without an imposed physical model. For the induction of an MLA, one can use all available features or select a smaller subset of them, removing redundant or spurious features. Many approaches can be used to evaluate the importance of features, which are related to groundwater pollution caused by nitrates. Feature selection (FS) is a process that selects a subset of the original features, optimizing the feature space considering a given criterion. FS contributes to a better understanding of nitrate pollution of groundwater, focusing on the relevant data and improving MLA performance. Different approaches for FS exist such as wrappers and embedded methods. Wrapper-based algorithms select a subset of relevant features based on the performance of a given learning method when the feature space is either increased or reduced. Within wrapper methods, different types of sequential searches can be applied to feed the MLA (sequential backward selection (SBS), sequential forward selection (SFS), sequential forward floating selection (SFFS) and sequential backward floating selection (SBFS)) were evaluated. On the other hand, embedded algorithms perform variable selection using internal measures of performance during the training of the algorithm. Random forest (RF) for classification was used as the learning method, where a bootstrap routine was incorporated into the wrapper and embedded methods to evaluate the generalization of the prediction model. A database of 20 features composed of hydrogeological and hydrological features, driving forces (sectors of activities that may produce a series of pressures, either as point and non-point sources) and remotely sensed variables (Normalized Difference Vegetation Index—NDVI) was used. Nitrates concentrations of 110 wells were used as a target feature. The SFFS RF wrapper outperformed the rest of the methods (mean misclassification error = 0.12, Area Under the ROC Curve = 0.92), selecting only three features: industries and facilities rated according to their production capacity and total nitrogen emissions to water within a 3 km buffer, livestock farms rated by manure production within a 5 km buffer and, cumulated NDVI for the post-maximum month, being used as a proxy of vegetation productivity and crop yield.

Primary authors

Dr Maria Paula Mendes (CERIS, Civil Engineering Research and Innovation for Sustainability, Instituto Superior Técnico, Universidade de Lisboa) Prof. Víctor Rodriguez-Galiano (Physical Geography and Regional Geographic Analysis, University of Seville) Dr Juan Luque-Espinar (Unidad del IGME en Granada) Prof. Mario Chica-Olmo (Departamento de Geodinámica, Universidad de Granada)

Presentation Materials

There are no materials yet.
Your browser is out of date!

Update your browser to view this website correctly. Update my browser now