Hidden Knowledge and Literature-based Discovery via LLMs

I read Samuel Arbesman’s “The Half-life of Facts” yesterday.

Good book, great even. I’ll be reading more by Arbesman.

One section got me thinking about LLMs, chapter 6 titled “Hidden Knowledge”.

It is about how there are breakthroughs sitting in plain sight, in public hidden knowledge.

For example, disparate facts spread across fields that need to be unified.

“Hidden knowledge takes many forms. At its most basic level hidden knowledge can consist of pieces of information that are unknown, or are known only to a few, and, for all practical purposes, still need to be revealed. Other times hidden knowledge includes facts that are part of undiscovered public knowledge, when bits of knowledge need to be connected to other pieces of information in order to yield new facts. Knowledge can be hidden in all sorts of ways, and new facts can only be created if this knowledge is recognized and exploited.”

He refers to the process of finding these things as “fact excavation”, cute!

Everything is digital now, there’s great opportunity to automate this process.

“DIGGING up hidden knowledge is now far from an impossibility, or even from being solely the domain of the specialist; it has become eminently possible and easy. Knowledge doesn’t get lost or destroyed any longer, and that seems to have happened even less often than we used to believe. Facts are now commonly digitized, and are ripe for being combined and turned into new facts. We are in a golden age of revealing hidden knowledge.”

He then touches on many points on how to attack hidden knowledge, e.g. search engines, incentive prizes, meta analysis, genetic programming, etc.

Surely, this is a dream use case for LLMs and people who have been thinking along these lines are all over it?

The book was published in 2012 (and written before that), so he says:

“WE are not yet at the stage where we can loose computers upon the stores of human knowledge only to return a week later with discoveries that would supplant those of Einstein or Newton in our scientific pantheon.”

We might be now… We have LLMs and scaffolding around LLMs and almost working agents.

The challenge is knowing what to ask, and to a lesser degree, how to ask.

It seems the field is called “Literature-based discovery” or LBD and the father of this field is Don Swanson.

Literature-based discovery (LBD), also called literature-related discovery (LRD) is a form of knowledge extraction and automated hypothesis generation that uses papers and other academic publications (the “literature”) to find new relationships between existing knowledge (the “discovery”). Literature-based discovery aims to discover new knowledge by connecting information which have been explicitly stated in literature to deduce connections which have not been explicitly stated.

– Literature-based discovery, Wikipedia.

I know the labs are seeking more than this, e.g. automated scientific discovery, but surely literature-based discovery is simpler and available right now.

Is it happening? Who is gathering examples?

I see LBD mentioned in “How artificial intelligence can revolutionise science” and discussed sceptically here.

Two areas in particular look promising. The first is “literature-based discovery” (LBD), which involves analysing existing scientific literature, using ChatGPT-style language analysis, to look for new hypotheses, connections or ideas that humans may have missed. LBD is showing promise in identifying new experiments to try—and even suggesting potential research collaborators. This could stimulate interdisciplinary work and foster innovation at the boundaries between fields. LBD systems can also identify “blind spots” in a given field, and even predict future discoveries and who will make them.

– How artificial intelligence can revolutionise science

Nice! Good, the kids are on it, right?

I see “Implementing Literature-based Discovery (LBD) with ChatGPT”.

The study specifically examines the effectiveness of these models in autonomously replicating well-established medical correlations and generating potentially novel hypotheses.

meh.

This benchmark suite might help to test this capability: “ResearchBench: Benchmarking LLMs in Scientific Discovery via Inspiration-Based Task Decomposition”, they make nice claims:

Our evaluation reveals that LLMs perform well in retrieving inspirations, an out-of-distribution task, suggesting their ability to surface novel knowledge associations. This positions LLMs as “research hypothesis mines”, capable of facilitating automated scientific discovery by generating innovative hypotheses at scale with minimal human intervention.

I see “Leveraging Large Language Models for Enhancing Literature-Based Discovery”.

The exponential growth of biomedical literature necessitates advanced methods for Literature-Based Discovery (LBD) to uncover hidden, meaningful relationships and generate novel hypotheses.

meh.

Go harder!

I want headline titles saying “we found something cool and novel-ish and we did it via LLMs”.

Kind of like the matrix multiplication finding recently from alphaevolve. But that was different, that was not doing LBD, it was LLM-enhanced objective function optimization.

The examples of LBD in the book are good, e.g. highly specific problems in medicine. But surely, one could create lists of thousands of such problems and throw each at an LLM workflow to “investigate”?

Are people doing this? Where? Who’s talking about it?

Maybe I need to do a deep research on this question :)

Here’s a cool pic from grok3 inspired by this piece:

Hidden Knowledge and Literature-based Discovery via LLMs