- WEB STARTUPS
- WEB JOBS
- ALL TOPICS
Semantic Web Archive
Sematic search engine Quintura has announced a new distribution partnership in which they will power the search for Maxim Digital’s Web properties which include Maxim.com and Blender.com. Yakov Sadchikov, President & CEO of Quintura said, “Our strategic partnership with the online leader in men’s lifestyle, Maxim Digital, demonstrates Quintura’s ability to provide interactive site search solutions to consumer web publishers with millions of monthly users."
The search functionality is now live on Maxim’s site. Google Trends shows about 30k users a day to Maxim and about 15k to Blender.
Semantic search provider Evri is moving into public beta today. Evri debuted at the D6 conference last month and I’ve embedded the presentation below. Founder Neil Roseman explains the Evri mission as, "(to) help users make more sense of the content on the Web." In the launch blog post tonight, Roeman continues, "Evri is building a data graph of the web, using our technology to understand the People, Places and Things in the web content read every day-and the actions that connect them to each other."
Update: Frederic over at RWW has more details on the semantic side of Evri.
NY-based Hakia is announcing the launch of their white-label semantic Web search (they call it Syndication Web Services) today. The idea is simple – you can now add Hakia’s services to your site offering more intense search functionality for your visitors.
The business model is welcoming; offer 30,000 searches per day free of charge and free of advertising then they will discuss a relationship with you past the 30,000. What this means is that small social networks may never pay anything for using the Hakia service but provide a great benefit for their users.
The Syndication Web Services include Web search, News search, Vertical search, Summarizer, Categorizer, Characterizer and Text Meaning Representation. The services provide an XML feed, and options to customize the feed. The first company using the new service is Berggi. Berggi has created a worldwide mobile search application.
Over the weekend, The web was abuzz with discussion about Microsoft considering the acquisition of natural language search company Powerset. Some time ago I had heard a rumor that someone was looking at Powerset, but was relatively uninterested. Hearing that the potential acquirer is Microsoft certainly makes it more interesting, but I have to say the concept leaves me more than a bit incredulous.
From skeptic to user
I became familiar with Powerset’s only competitor, Hakia initially because they are a New York company. I became intrigued with Hakia because several months ago I tried their search engine, and it worked – really well. This was a surprising result for me since I have always been a skeptic regarding all things relating to artificial intelligence, speech recognition, natural language processing, and other such fuzzy technologies.
At least in the area of natural language processing Hakia that has changed my mind. In fact, it has become common for me to use the Hakia search engine when Google does not deliver sufficient results.
Hakia and Powerset are part of the same general area of natural language search. The idea with both services is that you can actually ask specific questions and get answers. But there are critical differences between Hakia and Powerset. And those differences bring me back to my incredulity at the idea that Microsoft is taking a serious look at Powerset.
Powerset indexes 750 times slower than Hakia!
I have no expertise in natural language processing or semantic search, or any type of full text search for that matter. But as far as I can tell, Hakia’s technology is *far* superior to that of Powerset’s. Why would I say that?
Well first, as I have already said, it works. It is a real live search engine. I use it. I can’t say the same for Powerset. Powerset has yet to show anything but a search engine for Wikipedia. A big part of the reason Powerset doesn’t seem able to offer a real search engine is the fact that according to their own reports, it takes them about 25 seconds to index a page, based on an average of 25 sentences per page. According to Hakia it takes them 1/30th of a second to index a page. Essentially this means that Powerset cannot scale. It is seven hundred fifty times slower than Hakia!
Now you might assume that Powerset is slower because it’s applying some serious, and superior indexing mojo, and therefore what it is doing is much more valuable than what Hakia is doing. But alas that is also not true.
Hakia really knows how to read
Hakia is doing something called “ontological semantics”. What this means is that over the last four years, Hakia has developed an “ontology” for human expression. In layman terms, what this means is that what Hakia does when it indexes a page is to look at each sentence and figure out what the *questions* are that each sentence answers. Any given sentence usually answers 3 or 4 questions. These questions are coded and go into what Hakia calls their Qdex, or question index.
In order to be able to figure out what the relevant questions are for a given sentence, Hakia’s indexer has to literally read the sentence. By “read” I mean it has to understand the actual meaning of the sentence semantically. This is a big deal.
Powerset uses statistics + syntax but can’t actually read
So, while Hakia is actually reading, Powerset, does not actually attempt to understand what sentences mean. It uses a system that parses the syntax of the sentence and guesses matches based on statistics. But this approach means that for questions that do not match previously encountered syntactical patterns, the system will not be able to find answers, even if there are in fact answers in the database.
Powerset benefits from the Silicon Valley echo chamber
Now, if, for a moment, you presume that it is true, or even *possibly* true that Hakia is the superior service and technology, or if you even assume that Hakia is just equivalent to Powerset, why would Powerset be so continuously celebrated while Hakia is overshadowed?
The only answer I can come up with is that the west coast is such an echo chamber that very little sound gets in or out. And so it must be shocking when a New York company develops a technology that seems to beat the pants off something that should be pure Silicon Valley. Just a thought.
In any case, it seems, for the record, worth noting that we have the clear leader in natural language processing and search technology right here. And, as an admitted New York partisan, after a while it does get a little annoying to hear such continued fawning over a west coast company that is very likely, at the end of the day, just another Silicon Valley also-ran.
This article was authored by Hank Williams who is a New York-based entrepreneur who recently launched a new blog: Why Does Everything Suck? exploring the tech marketplace from 10,000 feet.
Earlier this week, I attended the NY Social Media Club meeting. The topic of discussion was Semantic Web and Web 3.0. There were two panelists, moderated by Howard Greenstein. The first panelist was Tim McGuinness, Vice President of Search from Hakia.com. Hakia is a NY-based startup that has a great meaning-based search engine. They just launched a new beta version with some social networking features this week. They use Natural Language Processing techniques to produce better search results. Nate Westheimer was the other panelist. He is the founder of BricaBox.com, a site that launched its Beta this week. Also, Marco Neumann, the leader of the NY Semantic Web Meetup, contributed a lot to the conversation. This post is not a strict summary, but rather some thoughts related and inspired by the discussion yesterday. I purposely use the term Meaning-Based web, and stay away from using the term Semantic Web, since it refers more to a set of technologies than a wider concept.
Meaning-Based Web – Motivation
Semantic Web is really about improving the connections and the meaning that one can gleam from the Internet so that when you search, the tool only returns the searches relevant to the meaning of what you are looking for. The goal of meaning-based web technologies is to make the meaning of the pages on the World Wide Web better understood by computers. This will drastically improve our ability to find things, and to ask intelligent questions about the world.
To illustrate the difference: today, when somebody does a search for “George Bush”, the search engines are fundamentally looking for a string of characters in the sequence you typed in. It does not understand that you are talking about a person, cannot relate it directly that George Bush is the president, etc. You want your search to find all the cases when George Bush is referred to through meaning, i.e. The President, 41rd President’s Son, “W,” Kerry’s opponent in 2004, etc. To us humans, these are obvious connections to make, to computers – not so.
There are two ways to approach this goal, through two philosophically different directions. They are the Semantic Web Techniques and Natural Language Processing, and in a way are two sides of the same coin.
Approach One – Evolve the Web (Semantic Web, Microformats)
The first approach is to evolve the web by adding more information to it. This means that the content producers will add more information to the Web, and thus enhance it to make it more understandable to machines. Semantic Web is a set of W3C standards that allow for the addition of data and query it. Very similar to the W3C approach, Microformats are another, more light-weight, way to do the same. The goal is for content to have semantic information attached to it so that computers can read it and form connections just like the humans do.
Using this approach, a page with information about a movie called “Magnolia” has hooks in the page (possibly invisible to the user) that mark it as such. A page about flower magnolia has markings that explain that it’s about flower.
Approach Two – Extract Meaning (through Natural Language Analysis)
The second approach is work harder to extract meaning from the web. To assume that enhancing the data is very cumbersome, and that while some people will do it, not everybody will. Additionally, adding more data means that there will always be holes and things that cannot be expressed easily. It would be great if computers could get closer to the real meaning of what the web pages are talking about.
This movement espouses Natural Language Processing techniques which are a set of techniques that try to extract meaning and relationships from text. The algorithms read the text and cull meanings from the text, coupled with an ontology of relationships defined elsewhere. For example, their ontology will know that car is a vehicle, and that car has certain actions that it can perform, and that it’s different from an inanimate object, which means that it cannot speak. As it “reads” the web pages, it applied the ontology to the content and records not where a specific word can be found in a document, but rather where a specific concept is. Additionally, these ontologies are language independent, except for some minor exception language particularities.
To come back to our example, using this approach the technology will automatically be able to tell the page is talking about a flower because it sees words like “grows,” “soil”, etc – same clues that allow us humans disambiguate the meaning of words. It is able to figure out meaning from the context.
In reality, both approaches have their strengths and weaknesses, although I have to admit to be more partial to NLP approach over the long term.
The implications here are profound. As this technology improves, searching will become seamless, and Search Engine Optimization will be the thing of the past. Search engines will understand the true meaning value of the content, and will be able to direct people towards you. The very cumbersome task of thinking up various words that your content can be searched on will also be a thing of the past.
Ability to understand text on a higher level (natural language processing) means that ads will be targeted even more precisely. A lot of ambiguities will be resolved easily, just by the engine asking you a few disambiguating questions. As a user, you will be rewarded for putting more search terms since the engine will be able to find the information you are looking for faster. You will be able to have an interactive conversation with your search engine until you zero in on precisely what you are looking for.
As far as search engines are concerned, I see meaning-based searching as the future. However, that cannot happen in isolation. For example, there is a lot of bad information on the web about child vaccinations, a vocal minority of sorts, mostly driven laypeople. There is also a tremendous amount of authoritative research data that shows the benefits of vaccinations. One of the reasons that Google search has been successful is that they have been able to harness the power of authority – their original Page Rank algorithm was based on the assumption that a page that has been linked to a lot is more authoritative than others. Since then, their search algorithms evolved thousand fold, but the central concept of authoritative sources is still very important on the internet (and in real life).
On top of the natural language techniques, and authoritative-based approach, the next realm in search is personalization and social networking. The next generation of collaborative filtering technologies will be collaboration-based with personalization mixed in. You’ll receive not just the best content, but the best content targeted to your current interests. If the search engine knows that I am currently interested in dancing, and I search for salsa, it will automatically return sites related to dancing, as opposed to cooking. Additionally, if it can mark studios or events that my friends have been to, that would be even more valuable.
So what is Web 3.0? Nobody knows yet, and neither do I. Right now, I think it’s emerging as a combination of several emerging technologies – meaning-based web, social networking, greater personalization, and locale-based information. I think that once you are able to create mashups based on meaning-based information, extract that information easily from existing data sources, then we will have Web 3.0. Lastly, many sites will offer not just access APIs, but a way to really integrate your application into them. Therefore, Facebook API , OpenSocial, and Ning are early precursors of Web 3.0.
New things will become possible. It will be easy to cross-reference unstructured documents with information stored in relational databases. It will be easy to create a personal profile page based on the information already out there on the internet. It will be easy to create something similar to tumble blog based on your web activities. We are not quite there yet with mashups, at least not based on what I’ve seen. We are close though. When everything becomes a data source, then we will have arrived.
Since I work for a company, Alfresco, that is focused on bringing Web 2.0 ideas into the enterprise, I am concerned with how this will affect the people behind the firewall. Just like on the public web, I see a great opportunity to transform existing systems and ways of collaborating.
One of the reasons why many content management solutions exist is to add semantic meaning to data. When you create a taxonomy to classify your documents, you are adding semantic meaning on top of unstructured content. Much of the reasons we are doing it is because computers can not quite do it themselves. With technology improving, a lot of traditional document management systems will be fundamentally changed. Whole areas of taxonomy analysis, information architecture will be transformed, since the semantic web techniques will allow extracting these taxonomies automatically from the documents themselves.
I also see some great short-term opportunities in Natural language Processing technologies and services. If a document can be automatically tagged with metadata instead of humans having to do it, this leads us to much better user experience and thus more useful content management systems. Some of this will require better plumbing, some of it will require newer interfaces, the kind of that Adobe Flex or Microsoft SilverLight are starting to enable. This is why we are firmly committed to Flex as the future evolution of our user interface.
Since the next generation of the web will feature much better meaning-based technologies, this will also dramatically improve collaboration and information sharing. Tools will be developed that will become agents, searching in the background for information that’s relevant to your work interests, and will automatically notify you of things you didn’t even know you were looking for. As you are working on solving a problem, an agent will also be searching for a solution to your problems, both inside the intranet and on the public web. The agents will also be able to traverse your social network, and connect you with other people in your social network or company who have expertise in the area.
Auto Tagging, Auto-Classification, new ways of collaborating – Wiki-based, Mashup-based, are all transforming the public web. And these superior ways of collaborating are moving rapidly inside the enterprise. Forrester talks about tech populism – the idea that as the web in becomes more user-friendly, enterprise users will demand the same simplicity and interactivity they are becoming used to.
This is the future I am excited to be a part of.
Some more resources on Meaning-Based Web and Web 3.0:
- Great article about Semantic Web From Scientific American by Tim Berners-Lee, the inventor of the world wide web.
- Semantic Wave Report from Project 10X
- Ling pipe – an advanced NLP java library. New York-Based and partially open source software.
- Twine – From Radar networks – a semantic web startup that just got some funding.
- Hakia.com – NLP-based search engine
This article was authored by Jean Barmash, a New York-based technologist. Jean is the Director of Technical Services at Alfresco, the Open Source Enterprise Content Management company. He also blogs at NY Web Guy.
Tonight NY-based AdaptiveBlue is announcing the launch of the latest version of their BlueOrganizer Firefox (and Flock) add-on codenamed Indigo. I had a discussion with CEO Alex Iskold and Biz Dev Director Fraser Kelton about the new features and my notes are below.
Iskold loves to talk about "semantic Web" — I think he has used the term more than any other single person online today. In simple terms, BlueOrganizer is a "smarter" way to browse. It takes normal links and enhances them. It senses what a page is about and can switch how it handles the page based on content type. For example, if you are on a book page on Amazon, BlueOrganizer knows and adjusts the links it presents to you. If it’s a movie site, you might see Fandango but you wouldn’t see that as an option on a music page. The little icons on the toolbar change to reflect what type of page it is.
There is a tie-into many of the major social services including: Twitter, Tumblr and Lijit which lets you export your saved items directly. It’s a good way to get your favorites out to your friends quickly.
When you install BlueOrganizer, it filters through your Internet history to determine what initially shows up on your BlueOrganizer profile. Iskold says that none of this data is transmitted to AdaptiveBlue.
One of the interesting bits I noticed when watching the video is that if you install the Smartlinks widget (another AdaptiveBlue product), it automatically adds BlueLinks to the site. This is a very smart distribution move. Once you install the BlueOrganizer add-on, it scans Web pages and injects the SmartLinks into the page as it finds them. It’s a good idea but at the same time, could it take away my chance to earn affiliate revenue? If I have a link to a book on Amazon using my affiliate code, and then the person goes to Amazon thru the SmartLink, I lose that sale. Perhaps there is a way to engineer it so I still receive credit.
The add-on also makes the most out of microformats so if you click on an address (that has microformats in use), it presents the address with links including Google Maps and other location-based information. The system also recognizes 500 common names and by clicking on a name, it provides a menu of options including Wikipedia and Google Search.
Some stats that Iskold shared include: 1.3 million downloads of the add-on, 5000 blogs have installed SmartLinks and hundreds of thousands of active users of BlueOrganizer.
The company continues with the same revenue model we have written about previously — affiliate sales. When you click to purchase a book or movie through BlueOrganizer and don’t have the affiliate field set, the commission goes to AdaptiveBlue. Amazon came out last week with a strong notice about people using their own affiliate code for sales. Not sure how that impacts how the affiliate codes work with BlueOrganizer.
Here is a simple animation showing how BlueOrganizer/Smartlinks work on Amazon, with an address and on a Web page: