« The platform is dead. Long live the platform | Main | Semantic SuperPositions »

Will the real Semantic Web please stand up

We live in an amazing and unique time.  Most of you reading this blog were alive at the birth of the global computer, around 15 years ago. In that time the computer has never been switched off, never been rebooted and has grown to an almost inconceivable size and complexity. The shear storage and processing power is almost impossible to calculate. The computer is fed information and programmed by the actions of around a billion users, night and day, evolving at an incredible speed.  For example, in the last two years, over 14 million blogs alone have appeared, seemingly with no effort or investment!

But there is something else going on other than computing on a grand scale. A new type of approach to computing is arising, one which fundamentally changes the relationship between the user and the computer.  I am talking about a new approach which is based on tapping into the collaborative effort of millions of users to programme software through the everyday actions of the users. The new programs are effectively learning systems that extract training and feedback from users actions on an unprecedented scale. Fuzziness, statistics and learning over programmatic logic.
The Google spell checker is a great example of this. Google could have sat a bunch of programmers down and coded a spell checker using a dictionary and lots of rules. Doing this in every language under the sun and keeping it current as new words come into being (e.g. blogging) would have been a great effort. Instead, Google uses the actions of the users to programme the spell checking, extracting patterns of behaviour from users retyping misspelled words and feedback on when the user accepts a suggested spelling correction.  Amazon's people who bought this book also bought these system is a more limited example.
Built on participation between the users and the system, the result is what you might call collaborative intelligence.
It is an emergent rather than programmed.
It is interesting to note that this is also the same transition that artificial intelligence went through. It became clear that predicate logic based solutions did not scale well and the field turned to fuzzy logic, statistics and neural networks where systems required training rather than programming.

The other important quality of this approach is scalability. Implicitly this scales, in fact it thrives on scale.
Traditional programmatic approaches, essentially based on logic, have a harder time scaling.

Considering that it really is only in the last few years where the hardware costs and online community size has enabled experimentation at scale, I am very excited about what the next 10 years will bring in this direction.

So this brings me to the title of this blog. It seems to me that humans are very good at semantics and that systems that are based on human computer collaboration (i.e the emergent properties of large numbers of users) will be very important in semantic based systems. You could consider del.icio.us and Flickr and the massive rise of tagging and microformats to be very early examples.  If the collaborative approach of del.icio.us could be synthesised with more sophisticated semantic methods such as RDF then we might really be cooking with gas.

So I conceive of the Semantic Web including applications built as collaborative emergent systems. 
Here in lies my problem. The Semantic Web as defined by Tim Berners-Lee's and expressed in his paper on the design issues for the Semantic Web, expressly excludes any type of fuzzy system from being a Semantic Web application (see exert below and comment). This is because he requires applications to be logically provable and guaranteed so that first order predicate calculus (predicate logic)  is the only logic that the Semantic Web admits. The example TBL gives is of a banking application needing to be guaranteed.
I have two main issues with this:

1) Why exclude the Semantic Web from the exciting possibilities of fuzzy and statistical approaches to  semantic systems. Can't both be included, a banking application just requires a stricter criteria on statements it can operate on. Applications don't need to be guaranteed to be useful (although I admit Banking applications do!!).

2) Will this massively scale? What gives us reason to believe it will? FOPC based systems have proven difficult to scale in several fields so far. TBL admits that the Semantic Web approach is not very different from previous approaches that did fail to scale. The basic point is that FOPC based systems cannot cope with inconsistency (as TBL points out) , as you scale, keeping consistency in practice becomes harder. 

So, what will the semantic web be like. I guess in time the real semantic web will stand up.

The rest of the blog looks at TBLs semantic web design paper in more detail and may not be of great interest to most readers

First of all, thanks Rick and Ian for persevering with all my questions.

Fuzzy or not has been the main theme behind all my SW blogs to date. Tim Berners-Lee's is quite clear - Not.
I just don't get why not, certainty is just a special case of fuzziness, why can't we include both?

We are back again to where I started perfect or sloppy rdf shirky and wittgenstein.html which was based on Tim Berners-Lee's paper  you mentioned Rick

This quote has almost the entire point I am trying to make in it. I'll take a few sentences at a time and explain what they mean to me.

"The FOPC inference model is extremely intolerant of inconsistency [i.e. P(x) & NOT (P(X)) -> Q], the semantic web has to tolerate many kinds of inconsistency.

Toleration of inconsistecy can only be done by fuzzy systems. We need a semantic web which will provide guarantees, and about which one can reson with logic. (A fuzzy system might be good for finding a proof -- but then it should be able to go back and justify each deduction logically to produce a proof in the unifying HOL language which anyone can check) Any real SW system will work not by believing anything it reads on the web but by checking the source of any information. (I wish people would learn to do this on the Web as it is!). So in fact, a rule will allow a system to infer things only from statements of a particular form signed by particular keys. Within such a system, an inconsistency is a serious problem, not something to worked around. If my bank says my bank balance is $100 and my computer says it is $200, then we need to figure out the problem. Same with launching missiles, IMHO. The semantic web model is that a URI dereferences to a document which parses to a directed labeled graph of statements. The statements can have URIs as prameters, so they can may statements about documents and about other statements. So you can express trust and reason about it, and limit your information to trusted consistent data."

1)Toleration of inconsistecy can only be done by fuzzy systems. We need a semantic web which will provide guarantees, and about which one can reson with logic.
Here TBL specifically excludes fuzzy approaches from the semantic web. By extension other statistical and learning based approaches to knowledge systems are also excluded. The reason given is that guaranteed and provable is an absolute requirement. If your app is not guaranteed it is not a semantic web app. This immediately limits the concept of the semantic web to what is computable by logic rather than what is usefully computable by any means.
Sure banking applications do need to be guaranteed, so they should use rules that only operate on provable, trusted statements. But there are loads of application of semantics where usefulness rather guarantees is the goal.
I do not see why it need be one or the other, you just have stricter requirements for proof in a banking app than a fuzzy app. See Semantic Superpositions for thoughts on a semantic web that included fuzziness.

Considering FOPC approaches have been largely discredited in the field of AI and replaced by fuzziness, this would seem a risky limitation to impose.

2)Any real SW system will work not by believing anything it reads on the web but by checking the source of any information. (I wish people would learn to do this on the Web as it is!). So in fact, a rule will allow a system to infer things only from statements of a particular form signed by particular keys. Within such a system, an inconsistency is a serious problem, not something to worked around
The necessary consequence of 1). is, as TBL states here, that in any SW system an inconsistency is a serious problem. Because of the guaranteed requirement, it isn't even enough that the data is accidentally consistent, it must be logical consistent i.e you will only encounter an inconsistency if there is a programming fault or corruption, standard user action should not be a factor. That is, the statements a SW app is using must be guaranteed consistent.

This means semantic web applications are quite fragile, the larger the scale the harder to maintain consistency in practice, whereas statistical approaches work the opposite way, the larger the scale the better they work. 

Any SW application therefore requires there to be only one version of the truth, i.e. it can only work with consistent statements.  However, there are many things we wish to describe where there is no one version of the truth.
Here is the rub; this is a result only of the requirement to be logically guaranteed. There are many computational approaches that can operate on inconsistent statements, fuzzy system, statistical approaches, neural networks. These can mine huge value out of those statements. None of that is possible with Semantic Web applications (as defined above), all those rich patterns must be collapsed into a single consistent version of the truth before the application can operate on it. The Google approach to spell checking is a great example of using such statistical approaches rather than logic to programme the spell checker.


The requirement for consistency in practice is very tough because humans are in the loop of data. Here we run straight into the fact RDF is designed to allow multiple agencies to make statements about the same thing. Even if two agencies are using the same URI and the same definition of a particular property, when users come to enter data and have to make classification decisions based on that URI description, the users will not classify the same thing in exactly the same way. The URI is not an authority, it cannot guarantee consistency between agencies e.g. you cannot show two copies of Harry Potter to the Editions URI and ask it if they are different editions or the same. People make that call according to there own interpretation of the description of the Concept. 
Reversing that around, if you receive two statements about the number of Editions that exist for a Harry Potter book and one states 1 edition and the other states 2 editions. The only way to arbitrate between them is to get the actual real books out and examine them against your own interpretation of the URI definition.
What I have described above is the fact that single authorities only make sense for certain classes of problem. i.e. where there is only one version of the truth. They make perfect sense for bank accounts, in the library domain, each library has an equal right to make statements about a book whilst cataloguing it so there is no concept of one authority. Similarly, who is the authority that decides if a photo is a smiling face or a sad face.

The result of all that is that to guarantee consistency, for a particular SW system, there can be only one authority for statements or else inconsistency will arise from user actions. This allows any conflict to be resolved by asking the authority to decree. Note also, that it is not good enough that statements don't conflict with published statements from the authority, the authority may not have published all possible statements, statements must actually agree with statements made by the authority.

TBL also says

"

A semantic web is not an exact rerun of a previous failed experiment

Other concerns at this point are raised about the relationship to Knowledge representation systems: has this not been tried before with projects such as KIFand cyc? The answer is yes, it has, more or less, and such systems have been developed a long way. They should feed the semantic Web with design experience and the Semantic Web may provide a source of data for reasoning engines developed in similar projects.

Many KR systems had a problem merging or interrelating two separate knowledge bases, as the model was that any concept had one and only one place in a tree of knowledge. They therefore did not scale, or pass the test of independent invention. [see evolvability]. The RDF world, by contrast is designed for this in mind, and the retrospective documentation of relationships between originally independent concepts."

3) They therefore did not scale, or pass the test of independent invention
For any SW app to have guaranteed consistency,  independent invention is not possible because you would need to force all statements from two separate agencies to be the same, and that means they are not independent at all i.e. one agency is not free to act independently of another because that will cause inconsistency.
It then rather seems that for all intents and purposes that independent descriptions are excluded from any particular SW app by the requirement to achieve consistency, exactly how is does a semantic web app then differ from those failed experiments?

 

To sum up, I can't understand why the semantic web (at least as described by TBL) should exclude any approach based on fuzziness, statistics and inconsistency. The requirement of consistency, when taking statements from different systems, cannot be met because humans cannot be made to all agree on classification statements(what ever training or manuals you give them) and therefore will make inconsistent statements through their use of the computer systems. Whilst RDF is free to describe all the variety in the world, the Semantic web application can only make use of the tiniest portion of it.

From some of the comments I have received, clearly some people agree with the TBL vision and others don't.
In the end I guess it doesn't really matter. People will use RDF to do cool things and call them semantic apps even if they don't accord to TBL FOPC requirement for Proof.  I do think it is at the basis of a lot of sceptism from outside the Semantic Web community though, given the spectacular failure of FOPC to scale in previous attempts by the AI and KR communities. It might be an idea to really present this stuff clearly to either face up to this criticism or prove it false.

I personally have had enough of this topic now and am going to think about other things for a while :-)

Thanks to all those who have contributed to the discussion. I'm sure there are lots of people out there who will disagree with things I have said above.  Just goes to show how hard it is to get people to share the same concept of things, the world is fuzzy after all.

Posted on Wednesday, August 10, 2005 at 06:27AM by Registered CommenterJustin Leavesley in | Comments5 Comments

Reader Comments (5)

Hi Justin,

I appreciate the energy that you have put into your posts. You obviously have a lot to offer in advocating fuzzy processing on the web.

With regard to the TBL paper I fear that you have misunderstood the context. All your TBL quotes come from a document linked as "What the semantic Web isn't but can represent" from a section titled "RDF is not an Inference system". It is poorly edited but TBL is in a dialog with an imaginary friend who mistakenly thinks that he's advocating an inferencing system and he tries to explain how the semantic web is different.

"... the semantic web has to tolerate many kinds of inconsistency." His frame for the problem is to track down the inconsistency and resolve it insofar as possible with proof. Understand that when he says proof he's basically thinking of somewhat longer than usual queries. That's the kind of problem and solution that RDF was designed for - little bites of truth linked together. He's not rejecting fuzzy systems doctrinally, it's just not the approach he's talking about.

On the parent page "Semantic Web roadmap" http://www.w3.org/DesignIssues/Semantic.html he concludes with a vision of combining google-type search (fuzzy-ish) with "proofs in a certain number of cases of very real impact". Also, he says of the RDF logic layer: "The applications of RDF at this level are basically limited only by the imagination." I take that quite literally.

Good luck,
Rick
August 11, 2005 | Unregistered CommenterRick Thomas
Hi Rick,

Thanks. If the vision of Semantic Web applications isn't limited to FOPC then I've no beef.

It would certainly help me, and probabley at lot of others (Clay Shirky???) if there was perhaps a more clear presentation of this stuff. Unless I have missed some section of the W3C site (or is that the correct site!).
August 11, 2005 | Unregistered CommenterJustin Leavesley
I ran across this workshop description: "Probabilistic, Logical and Relational Learning - Towards a Synthesis" http://www.dagstuhl.de/05051/

Intriguing, it sounds like an overview of the current science. I'm guessing that most applications are still experimental; a little googling turned up most hits in bioinformatics. It's probably still early for developer-level overviews, that is, news we can use.
August 11, 2005 | Unregistered CommenterRick Thomas
Here's an interesting list of papers from the organizer of that workshop
http://www.cs.umd.edu/~getoor/publications.html

The gist is to do statistics on links in order to find entities in a graph. Research but with web applications in mind.
August 11, 2005 | Unregistered CommenterRick Thomas
Thanks Rick,
I look forward to reading these
August 11, 2005 | Unregistered CommenterJustin Leavesley

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
All HTML will be escaped. Hyperlinks will be created for URLs automatically.