From the vast amount of water monitoring data, you can extract clusters of substances and identify patterns that may tell you something about the source of micropollutants, researchers write in Environments.

When monitoring water quality, you collect piles of data to check that the concentration of certain micropollutants is not rising too high. But this data is not usually analysed further. To get more out of it, researchers at the Water Research Institute (KWR) looked at clusters of substances in the datasets, rather than looking at individual substances, to see if they could find patterns and say something about where the pollutants came from.

Drinking water

But it is not just a matter of leaving interesting data lying around. Drinking water sources - and therefore drinking water companies - are under increasing pressure,” says Tessa Pronk, data scientist at KWR. This is due to dry summers, but also to the increasing use of chemicals that do not belong in water. Together with colleagues Elvio Amato, Stefan Kools and Thomas ter Laak, she set up a project within the industrial research of drinking water companies to see if they could discover patterns and eventually explain the dynamic concentrations of micropollutants.

The team looked at a total of 196 clusters of substances at 19 monitoring sites along the Rhine and Meuse rivers and compared them with data from reference lists. These are reference lists of substances with, for example, similar uses,’ says Thomas ter Laak, an environmental chemist at KWR. From the composition of the substances within these reference lists, you can derive information about the source or the emission route. Close to the source, you would expect the substances from the reference lists to be found together in relatively high concentrations. As you move away from the source, such a group of substances may be chromatographically separated by sorption processes, for example, or partially degraded by biological activity.


Of these clusters, there were nine that had a similar composition in several locations. These included a cluster with metals, where you sometimes see PAHs; a cluster with salts and reactive (alkali) metals; herbicides; polychlorinated biphenyls (PCBs) and more. Pronk: ’You will then see substances with the same properties - such as all kinds of variants of hexa- and pentachlorobiphenyl - appearing in higher concentrations in the water under certain conditions, such as a heavy rain shower.

Ter Laak adds: ’Sometimes a particular data set deviates from the normal pattern, so something must have happened. It’s a kind of reverse engineering, where you first see an anomaly and wonder what happened. Then you make a hypothesis and understand what happened, rather than the other way around. According to Pronk, you see something similar in genomics research: First you cluster genes and see what information the genes in the cluster have in common, then you deduce what exactly they do, including unknown genes.

The existing data sets contain a lot of known substances, but of course there’s a lot more in the water that we can’t detect,” he says. In fact, we are only measuring a small fraction of all the man-made substances in water.


An important message that Pronk and Ter Laak want to convey with their research is that there are serious problems with water pollution. You may not be able to solve it immediately, but you can find out where and how it happens,’ says Ter Laak. For example, you can see a whole range of substances that we as a society are pouring into the environment in an almost continuous stream. You also see substances that behave much more erratically in time or space: the release of substances from stirred up sediments, run-off from rainfall, or a particular pesticide at a particular time of year.

The ultimate goal is to use all this data to create models that can predict such an influx of micropollutants. For example, we are working on regression models where you stack several factors on top of each other to see if you can explain the detected substances with certain conditions or properties of the substances,’ says Pronk.

An advantage of the ‘hierarchical’ clustering used here is that it is unbiased,’ explains Ter Laak. As an environmental chemist, I automatically look for things I know, so you don’t work in a completely neutral way. You have to look at the clustering first, and only then can you discover what is going on. All in all, this research method would allow you to better explain why you see a variation in substance concentrations. The researchers hope their work will help improve water quality.

Pronk, T.E. et al. (2024) Environments 11(46), DOI: 10.3390/environments11030046