One of the most enthusiastic discussions in Web 2.0 circles has been about the semantic web: is it possible to create better links to and from web pages, that is, links based on the meaning of the words? The perfect web page would take us to the information we need in a single click. Why are such links so difficult to add from web content today, and how close are we to realising the dream of automatic links to relevant content?
Linking: what HTML can and can’t do
Many people complain that HTML, the language of the web, is inadequate for creating good links. After all, HTML has only a primitive “meta” field for adding metadata, and the content of this field is optional – you don’t have to put anything in it at all. Frequently, it is filled with irrelevant content in the hope that search engines will index the page more prominently. Moreover, HTML links can’t subdivide a page – you can be linked to a page, but not to an article or picture on a page, using HTML alone. Google is an example of what can be done using HTML linking alone: reasonable for locating websites, but incapable of disambiguating search terms, and no good at all for finding content within a page.
Yet, of course, many present-day news and magazine websites have sophisticated links that could never have been achieved using HTML alone. What’s more, often these links are created entirely automatically. How is this done?
Consider a typical online news site, such as BBC News Online. When the journalist creates an article for online publication, he or she creates links to related articles at the same time, by searching in the publication archive or databases for any recent stories. Hence, an article about unruly students includes “see also” links to other recent stories relating to issues of education and classroom behaviour.
To create such links by hand takes time, perhaps three or four minutes per link. This may not seem excessive, but any reasonable news-based site will have several links per article, which can add substantially to the cost of creating a story. Even more of a problem is when a newspaper or magazine publisher loads archived articles from back issues to the web: users will expect these articles to include links to other content. It isn’t feasible to expect to create links from legacy content if each article added to the site requires 15 to 30 additional minutes per story. Plus, of course, journalists cannot be expected to remember or to be aware of all the possible linked stories in the archive.
In fact it is perfectly possible to create adequate links automatically, by holding content in a digital archive and using the power of full-text searching. After all, this is most probably how the journalist created the links in the first place, so it is logical to set up the archive to carry out this search directly. In fact many content-based websites generate such links entirely automatically and produce a result that is (if configured properly) as good as, and certainly more thorough, than anything a human journalist can provide.
For example, the KHL website contains ten construction-based magazines. When the site was launched early in 2008, over 15,000 articles from several years’ worth of back issues of the print magazines were captured and loaded to the site. It would have been a gigantic task to try to add links from all these articles. Instead, the text of the headline (which is separately tagged to the text in the story) is automatically loaded as a search in the database when an article is retrieved from the archive to display on the website, and the highest-ranked hits are displayed in a “related articles” box beside the story.
Shehzada Munir, project manager for the KHL project at VYRE (who developed the KHL website) comments: “All the articles in the system are tagged using taxonomy. Using this method, the system automatically picks up all the articles that are tagged as same categories / keywords, as well as checking they are approved and within active publishing dates. In this way, a list of related articles is accurately generated next to every news article on the site without any user intervention.”
So the limitations of HTML linking are circumvented, because the website can create links to other articles in the system database. If a word appears in the story headline (for example, the company “Terex”), it is reasonable to make it a search term for other articles. Such a system is widely used for many online versions of print publications. In a similar initiative, the British Medical Journal takes the title of specific items of content being read, and uses those words as a search term to display other, related BMJ Group content. In fact, the site is providing personalised links to further content with every search that is carried out, a very Web 2.0-type initiative, and this, comments BMJ’s head of information services Phil Caisley, has generated a surprising amount of click-through traffic.
Personalised content and advertising
Most advertising is wasted because the people who see it will never be interested in buying that product or service. For most advertising contexts, this is unavoidable – a newspaper or magazine advert is viewed by the entire readership, not just the targeted subset, whether or not they want to read it. Advertisers would ideally like to personalise advertising so it is displayed only to would-be users, and online delivery can make that happen.
The British Medical Journal provides targeted advertising of both display advertising and job adverts, by customising advertising to user preferences, using an ontology (classification) file that holds relationships between concepts. When users retrieve a specific piece of content, their selection indicates a preference. By knowing the user’s choice of topic, the system can derive its topics and then infer one or more medical specialities from these topics. Explained Phil Caisley: “The inferred medical speciality is passed as a parameter to the display advertising system, within which we've pre-categorised relevant ad campaigns with their target medical specialities. Our ad sales team can therefore sell ad campaign space on this inferred speciality basis. We also do something similar matching our pre-categorised careers content to job ads.”
In a similar initiative, the BMJ displays related stories linked to any story the user clicks on. How is it done? The BMJ uses an ontology. By clicking on a story, users have selected a topic – one of the BMJ’s 30,000+ ontologies. These are mapped to a corresponding internal topic list (1000+), which in turn is mapped to the list of medical specialities (70+), and those terms used as search terms when clicked. Hence the system generates new search suggestions with each user click.
Beyond HTML with RDF
If using HTML alone can achieve such good results, why look further? The tantalising long-term goal is to try to disambiguate terms in an article that have more than one meaning, and then link to the relevant meaning only.
One initiative in this direction is the use of RDF. RDF (resource discovery framework) has been around for several years, typically in highly complex applications, but is itself a very simple idea – just a machine-readable link between a subject, a predicate (defining the piece of data we are giving a value to), and an object (a value for the predicate). For example, the sentence “John has red hair” can be tagged as an RDF “triple” (so-called because every RDF statement has three parts) as “John [subject] – has hair [predicate)] – red [object]”.
One current development beyond HTML is to tag individual words on a web page using a simplified form of RDF, called microformats. These enable specific kinds of information to be tagged and then extracted from content; for example, Wikipedia uses a tag called “coord” to hold latitude / longitude details. Other similar tags are “about”, enabling you to talk about an individual item on a web page; “data type” (eg “date”), “place”, and “type” (eg person). Once the content has been tagged in this way, users can link very precisely to other content items. But microformats have two problems: first, they are not a fully accepted and widely implemented extension to HTML, so there are millions of websites that will not understand the coding. Secondly, all these codes have to be added manually, and it’s a time-consuming business.
Are there ways in which RDF links can be added automatically? Yes, RDF triples are increasingly used to link content in ways that HTML alone could never do. Take Wikipedia as an example of a rich source of content information; it has all kinds of information that would be very useful to be able to link to, for example, the names and current population for every capital city. This kind of information in Wikipedia is converted entirely automatically on a regular basis to a set of RDF triples, by a project called DBpedia (dbpedia.org), which describes itself as “a community effort to extract structured information from Wikipedia”. DBpedia now has 274 million triples, that is, 274 millions items of machine-readable factual information that can be linked to websites.
To see DBpedia in action, a recent research project added automatic links to BBC news stories. Rob Lee of Rattle Central was commissioned by the BBC to create additional links from news articles on the BBC online site, and the results of their 2008 project (called Muddy) automatically generates links via DBpedia from news articles to Wikipedia entries.
The results look very promising, and are a genuine step forward from whole articles to other article links, since the links here are from specific terms within an article.
It’s clear that the use of tools that are readily available to web developers now can provide a wide range of links between articles, and even to words within articles. It may not yet be fully semantic-based linking, but it’s better than anything we have ever had before, and what’s more, it’s done without human intervention.