Databases to machine learning: the stage is set for mathematicians

Ed Baker
Integrative
Published in
3 min readJan 11, 2022

--

In my Personal Introduction to this series of essays I quoted the physicist Murray Gell-Mann:

One obstacle to integration that keeps obtruding itself is the line separating those who are comfortable with the use of mathematics from those who are not. I was fortunate enough to be exposed to quantitative thinking from an early age.

Murray Gell-Mann (The Quark and the Jaguar)

Reflecting on my own inter-disciplinary experiences the shared mathematical background in physics, biodiversity informatics, bioacoustics, electronic engineering and even these days, taxonomy, has never been more obvious. A rigorous, mathematical way of thinking can provide links and insights between these fields.

Nature is written in mathematical language.

Galileo Galilei

Relational databases are intrinsically linked to the mathematics of sets — it is in part this well understood nature that means we should be very careful when choosing to adopt a different database model, as many of the ‘cool kids’ did at the start of the last decade. And now? A move back to relational.

It’s time for us to admit what we have all known is true for a long time: NoSQL is the wrong tool for many of the modern application use cases, and it’s time that we move on.

NoSQL Databases: Why You Don’t Need Them

This is not to say that NoSQL databases are always the wrong choice, or that some of the proposed advantages did not force some improvements to the scale of relational database software. Just that they were the next big thing and the ‘cool kids’ wanted their time to play with the new shiny technology. Whether that was appropriate or not.

In some ways the simplicity of the relational model, its relation to the simplicity of mathematical sets, the clearly defined schema, are what makes it attractive. The process of normalisation makes the data tidy, manageable, in some ways tangible; it is certainly what makes it efficient.

Schemas are ways of arranging data into knowledge. Often the more compact we can make the schema, the closer we are to understanding the problem, the easier the data become to understand, to compute with. Schemas are inherently mathematical.

Grappling with how to create appropriate schema is an important part of science, just think how the periodic table and the standard model have influence chemistry and particle physics. Or the unhelpful mess that is the Dewey decimal system. Workable schemas in biology are harder: often trying to extract insights based on less complete, yet more complex data. Sometimes it is hard to know when to even begin (although I have made an initial effort in a small area: Standardisation of bioacoustic terminology for insects).

Two dangers face the student seeking to rationalize and codify a terminology that has grown up empirically and that is beginning to differentiate regionally or according to faculty or in other ways — as must always tend to happen. One danger is that of legislating prematurely and clumsily for hypothetical future requirements; the other is a too easy-going and long-sustained attitude of laissez-faire arising from wishing to let the mud settle before trying to penetrate the shadows of often chaotic and obscure usages. If the former danger must always be borne in mind, the latter is more insidious; while we wait for the mud to settle, divergence may be increasing, and we may be faced with the need to cure what we might have prevented.

Broughton (1963) Acoustic Behavior of Animals

This quote reminds me both of the difference between relational and the various relaxed-schema forms of databases, and the inherent complexities of organising information (particularly when it relates to organisms and their long histories of evolution).

The coming age of machine learning is on one hand truly exciting, yet on another nothing but statistics performed quickly by machines. These days I sometimes feel we would be better off listening to the statisticians rather than the ‘cool kids’ climbing on the next bandwagon. Machine learning is already abundant. Doing machine learning well, at scale, requires some serious thought on systems as schematically challenging as organisms, ecosystems, the biosphere.

As I side note there is some truly wonderful and amazing machine learning research and projects. There is also a lot of hype — which if NoSQL is a precedent, might lead to a lot of backtracking.

The stage is set for mathematical thinking. Almost everywhere.

--

--

Ed Baker
Integrative

Bioacoustics, biology, technology, biodiversity informatics http://linktr.ee/edwbaker