A Spider that Reads the Whole Web

Diffbot, a Stanford startup, is building an AI-based spider that reads as many pages as possible on the entire public web, and extracts as many facts from those pages as it can. “Like GPT-3, Diffbot’s system learns by vacuuming up vast amounts of human-written text found online. But instead of using that data to train a language model, Diffbot turns what it reads into a series of three-part factoids that relate one thing to another: subject, verb, object.” (MIT Technology Review, 4 September 2020) Knowledge graphs – which is what this is all about – have been around for a long time. However, they have been created mostly manually or only with regard to certain areas. Some years ago, Google started using knowledge graphs too. Instead of giving us a list of links to pages about Spider-Man, the service gives us a set of facts about him drawn from its knowledge graph. But it only does this for its most popular search terms. According to MIT Technology Review, the startup wants to do it for everything. “By fully automating the construction process, Diffbot has been able to build what may be the largest knowledge graph ever.” (MIT Technology Review, 4 September 2020) Diffbot’s AI-based spider reads the web as we read it and sees the same facts that we see. Even if it does not really understand what it sees – we will be amazed at the results.