Machine Learning Predicts Highest-Risk Groundwater Sites to Improve Water Quality Monitoring

NC State

An interdisciplinary team of researchers has developed a machine learning framework that uses limited water quality samples to predict which inorganic pollutants are likely to be present in a groundwater supply. The new tool allows regulators and public health authorities to prioritize specific aquifers for water quality testing.

This proof-of-concept work focused on Arizona and North Carolina but could be applied to fill critical gaps in groundwater quality in any region.

Groundwater is a source of drinking water for millions and often contains pollutants that pose health risks. However, many regions lack complete groundwater quality datasets.

“Monitoring water quality is time-consuming and expensive, and the more pollutants you test for, the more time-consuming and expensive it is,” says Yaroslava Yingling, co-corresponding author of a paper describing the work and Kobe Steel Distinguished Professor of Materials Science and Engineering at North Carolina State University.

“As a result, there is interest in identifying which groundwater supplies should be prioritized for testing, maximizing limited monitoring resources,” Yingling says. “We know that naturally occurring pollutants, such as arsenic or lead, tend to occur in conjunction with other specific elements due to geological and environmental factors. This posed an important data question: with limited water quality data for a groundwater supply, could we predict the presence and concentrations of other pollutants?”

“Along with identifying elements that pose a risk to human health, we also wanted to see if we could predict the presence of other elements – such as phosphorus – which can be beneficial in agricultural contexts but may pose environmental risks in other settings,” says Alexey Gulyuk, a co-first author of the paper and a teaching professor of materials science and engineering at NC State.

To address this challenge, the researchers drew on a huge data set, encompassing more than 140 years of water quality monitoring data for groundwater in the states of North Carolina and Arizona. Altogether, the data set included more than 20 million data points, covering more than 50 water quality parameters.

“We used this data set to ‘train’ a machine learning model to predict which elements would be present based on the available water quality data,” says Akhlak Ul Mahmood, co-first author of this work and a former Ph.D. student at NC State. “In other words, if we only have data on a handful of parameters, the program could still predict which inorganic pollutants were likely to be in the water, as well as how abundant those pollutants are likely to be.”

One key finding of the study is that the model suggests pollutants are exceeding drinking water standards in more groundwater sources than previously documented. While actual data from the field indicated that 75-80% of sampled locations were within safe limits, the machine learning framework predicts that only 15% to 55% of the sites may truly be risk-free.

“As a result, we’ve identified quite a few groundwater sites that should be prioritized for additional testing,” says Minhazul Islam, co-first author of the paper and a Ph.D. student at Arizona State University. “By identifying potential ‘hot spots,’ state agencies and municipalities can strategically allocate resources to high-risk areas, ensuring more targeted sampling and effective water treatment solutions”

“It’s extremely promising and we think it works well,” Gulyuk says. “However, the real test will be when we begin using the model in the real world and seeing if the prediction accuracy holds up.”

Moving forward, researchers plan to enhance the model by expanding its training data across diverse U.S. regions; integrating new data sources, such as environmental data layers, to address emerging contaminants; and conducting real-world testing to ensure robust, targeted groundwater safety measures worldwide.

“We see tremendous potential in this approach,” says Paul Westerhoff, co-corresponding author and Regents’ Professor in the School of Sustainable Engineering and the Built Environment at ASU. “By continuously improving its accuracy and expanding its reach, we’re laying the groundwork for proactive water safety measures across the globe.”

“This model also offers a promising tool for tracking phosphorus levels in groundwater, helping us identify and address potential contamination risks more efficiently,” says Jacob Jones, director of the National Science Foundation-funded Science and Technologies for Phosphorus Sustainability (STEPS) Center at NC State, which helped fund this work. “Looking ahead, extending this model to support broader phosphorus sustainability could have a significant impact, enabling us to manage this critical nutrient across various ecosystems and agricultural systems, ultimately fostering more sustainable practices.”

The paper, “Multiple Data Imputation Methods Advance Risk Analysis and Treatability of Co-occurring Inorganic Chemicals in Groundwater,” is published open access in the journal Environmental Science & Technology. The paper was co-authored by Emily Briese and Mohit Malu, both Ph.D. students at Arizona State; Carmen Velasco, a former postdoctoral researcher at Arizona State; Naushita Sharma, a postdoctoral researcher at Oak Ridge National Laboratory; and Andreas Spanias, a professor of digital signal processing at Arizona State.

This work was supported by the NSF STEPS Center; and by the Metals and Metal Mixtures: Cognitive Aging, Remediation and Exposure Sources (MEMCARE) Superfund Research Center based at Harvard University, which is supported by the National Institute of Environmental Health Science under grant P42ES030990.

/Public Release. View in full here.