semantic wikipedia

a wiki is where data goes to die

I know, a phrase usually reserved for email

It seems strange to read that doesn’t it. But read on for my take.

A wiki is great. People can add information, contribute their know how, edit it, create a knowledge base, share the information with the rest of the world… full of people. But for computers, it’s a fairly complicated dead end. Most wikis (I’m sure there may be semantic ones out there, but I want write this first and then see if I’m in alignment or not) are designed, rightfully so, for people in mind. People can look at a page and understand what’s there, scan it, get the information they want, and be on their merry way.

But for a machine, it’s not quite so easy. And by machine, I usually mean a programmer trying to figure out a clean and nice way to leverage all that information stuck in a wiki in a meaningful manner for their own application (and therefore their own users). It may be for a mobile app that wants to show the information and pictures about musical instruments, a web app that shows the government types of various countries , or even a simple inline reference link on a blog to help users get a quick grasp of a topic without leaving the page or having to expound on it right in the post itself. Most wikis (let me pick on Wikipedia) ignore the potential value of having a great semantic markup embedded in their articles so that machines can access and read that information and consume the data and help spread knowledge.

Right now, all that can be done is to link to the relevant article in Wikipedia. If we wanted to just extract a short primer text about the TV show Futurama, we can’t. We need to start to parse the whole article and then try to guess where that information is located. Usually something a bit too much for a developer who just wants to leverage the system, not architect a whole search and parsing system for natural language.

semantically dataverting a wiki

dataverting… I just made that up

Perhaps a useful definition of dataverting would be: To convert unstructured data into semantically accessible data.

So what can be done to datavert something like Wikipedia? How about adding some structure in the form of template? I know, people will start to scream that it’s destroying the whole concept of a wiki and free form edits and so on. But really, it’s not that bad, and it’s optional, and in the end it’ll be completely transparent to the end user (if you wish).

Each topic may have a basic template. Perhaps for a TV show, it’ll be something like title, broadcast dates, episode lists and so on. I’m sure there’s much more. Just like any other extensible markup language, you should always be able to add more fields if that show needs something unique and specific (rerun count, remakes, same universe shows, etc). The basic template should be common for the subject, the extensions from a well established vocabulary. The editor just needs to fill in the sections as needed. If there’s misc stuff to add, just put it into a misc tagged area of the template. In the end, it should look no worse than the current system, which is the equivalent of a huge “misc” container.

but computers and humans read data differently

(for now :P)

So for people, it helps to have a nicely formatted, image pretty list of TV episodes. Computers simply want the raw data. One way is to transform, like we do HTML with CSS, from the raw data to something human readable. But this means that the people entering the data need to enter it in raw form… which again destroys the ease of use of a wiki. But this is where the whole crowdsourcing business can get interesting.

If for some reason, people can’t be forced to use a specific generic template, we can just have a human readable section and a machine readable section maintained and synchronized manually by editors. Of course, not all articles may be popular enough to warrant an editor to translate human to computer, but the larger articles undoubtedly will have someone (probably a techie) who wants to dedicate some time to keeping the data in sync. Or even visa versa.

so what do you get out of all this?

better access to data for even more people

While the concept is to allow machines to read data, often times those machines are leveraging that data in order to display it to an end user. Being able to properly recognize sections of the article and being able to transmit that data to the end user instead of redirecting them is probably a good thing. It can deliver the needed information to the user, allow the user to stay in context, and be able to consume and move along without loss of time and effort. It helps to spread the knowledge in a place like Wikipedia even wider and faster.

If one wanted the 2 sentence summary, you can look in that marked section: #summary. Or if you needed something of paragraph length, there may be a tag for that (#detailed). Perhaps there’s a related field listing similar concepts (#related). The point is now developers can smartly leverage all this information and deliver it to more people. In theory, you can even allow content editing and updating remotely, which now can make more sense since the system will be able to give you discrete chunks of information rather than the all or nothing format of having to visit the wiki itself.

will this happen?

long term answer: yes, short term answer: no

As many know, if it ain’t broke, why bother fixing it, especially if it requires more effort upfront + more effort thereafter in maintenance. It may not happen with this current iteration of Wikipedia, and maybe it wont have to happen if there’s smarter AI/linguistic ways of understanding the text. But until that happens, I would think it’s easier to do a semantic markup of the wiki and start there. It need not happen immediately, and what may be interesting is that it probably only needs to happen on an as needed basis.

Think in terms of a “static” blog post like this. If I were to write about woodwind instruments, being a static post, I just need that particular page to be cleaned up. As a blog author, for my readers, perhaps I can address the time needed to template the page to make it semantically accessible. Thus ensuring that any inline references I may want to leverage in my post will have a clean backend source to work with.

One step at a time. It’s always daunting to have to think about the millions of articles that could use dataverting, but if we just do it in terms of what’s needed, it may not be as bad.

Anyway, some thoughts. Feel free to disagree or discuss :). I know it’s pretty raw so there may be something that already does this, or it may have been something that’s been proven as not viable/realistic. But it was fun to think about either way. This one’s been on my radar of thought for a while now, but pointers to @davewiner for piquing my interest again with his post on ”Wikipedia and Explainers”.

* This time I’m combining some structure with an image. Maybe I should do an image for each section? I think it at least helps the readability since I’m a bit wordy, I need ways to break apart a wall of text.