Friday, July 25, 2008

Search relevance and the data-driven web

My frustration with CNET's informative but misguided article on Microsoft's BrowseRank has only redoubled with the reading of some other stuff like Andy Ha's breathless and utterly silly piece on VentureBeat -- I refuse to link to it, but you can go look it up if you're curious. Example phrases (good for searching on, too ;-):

"could really shake up the search landscape"
"it would upend the way companies get attention online."
"it might be able to reverse MSN Search’s slide towards irrelevance."
"a new kind of search"


I decided to channel my powers for good instead of evil, and rather than trolling in the VentureBeat comments, I decided to trawl through the Onotech archives and pull together many pieces I've written about search relevance into this one handy post that I can refer to in the future.

Note that this is not just about search. Search is just one facet to the golden rectangle of the data-driven web -- search, ad targeting, discovery, and analytics -- that makes every modern web company rivers of cash. All of these product mechanisms share underlying concepts and technologies, and I've blogged about them (and the infrastructure that makes them possible) many times over the years. Here, in time order, are some of the better posts on the concepts behind the data-driven web.

Another time, I'll pull together a second summary blog post on infrastructure thoughts - because there are some really cool software technologies that make this all possible.

This is also probably a great time to show off my birthday present from Somer Simpson:
machine learning power to the data poster
Schweet. I may make T-shirts -- email me if you want one ;-)

OK, here we go:

Click data drives not just search, but all advertising

The impact of monetization on search result ranking

Query type differentiation and machine learning

All search is social search because people determine value

On relevance/selection mechanisms, and types of voting

The four fundamental ways to determine ranking

How to better understand user intent

CNET on BrowseRank: An informative article with a nonsensical premise

It's great to see a well-written, informative article, "Microsoft tries to one-up Google PageRank," about an innovation in search ranking, published at a major tech news outlet like CNET. Stephen Shankland's piece on Microsoft's BrowseRank is definitely all that. It thoughtfully discusses the concept that people's click behavior is a very powerful (and different) voting popularity mechanism than the link graph of PageRank to assess web page search relevance. All good.

However it's maddening to see yet another naive or deliberately misleading article on an innovation in search ranking, that perpetuates persistent misunderstanding of how search works and what makes search better. Maybe it's just a standard media trope to essentialize any topic to the point of parody, but I'm so tired of seeing pieces that fetishize "The Algorithm" as some singular magical trump card by which search is won and lost. Combining scientific models to produce a ranking function is difficult, obscure, and incredibly important, so to some extent I can understand why the media keeps writing stories focused on this. But it's sort of like watching a Saturn V take off for the moon, and turning confidently to your neighbor and saying, "that thing takes off because NASA figured out The Engine... I hear the Russians are working on something better than "The Engine."

Please. Great search is made up of a couple major areas of competence:
* Scaled aggregation of content. It doesn't matter how good your matching might be if you don't have what the user wants in your cupboard of goodies.
* Scaled user voting behavior to assess value. Be this the ability to crawl and assess hyperlinks, access to and the ability to assess user clicks, or one of (as Udi Manber rightly says) hundreds of other variously valuable and tractable methods, you need access to behavior metadata, crystallized in one form or another
* A scientific process and platform by which you can run many experiments to fine-tune the value of various voting behavior signals.
* A technological platform to rapidly and cost-effectively perform this mind-boggling level of computation, faster than the answers and the questions are evolving on a global scale.
* A bunch of really great scalability engineers to build that platform
* A bunch of really great search scientists to conceive, build, and test models on a continuous basis
* Oh, and a very effective monetization effort to pay for all of this incredibly expensive infrastructure, people, and time cost.

It sounds a lot more like General Motors circa 1955 than it does "genius in a garage cooking up the next great thing," doesn't it? Perhaps that's why the media always falls into this trap - the brilliant loner or breakthrough insight that changes the world is just such a powerful narrative hook, whereas a discipline, competence, process story is just kind of... boring.

The reality of world-class search today is that it's big, complicated, and multi-faceted. It has emerged into a discipline of technology all its own, and advances will tend to be subtle and hard to explain. Remember that whenever you read the next excited story about the Next Great Algorithm.

Update: I've written a new post that channels my grumpiness to productive ends -- a few more thoughts on the data-driven web and an index to my various posts on search relevance. Reporters, please read them all... twice!

Saturday, July 19, 2008

The tech bubble's toxic legacy

An article on GigaOm caught my eye: "With Exits Barred, VCs Keep Investments Flat". In the piece, Stacey Higganbotham relates the reassuring news to entrepreneurs that VC investment in early stage companies remains strong, and that later-stage deals are being done at a historic clip -- 318 in the second quarter alone. But then she drops this little tidbit:
If those companies don’t exit within the next two to three years, VCs will have to start selling at a loss or pushing firms into bankruptcy.

Huh?! What's wrong with this picture? Any 'late stage' deal likely means a C and possibly a B round -- so the company involved has been in business for 2 or 3 years already when it raises the round; and the round should last it for at least another 18 months.

The presumption here is that tech startups don't get to profitability.

This is the worst legacy of the tech bubble, when lots of permanently unprofitable companies were started, funded, and even IPOed. It's got nothing to do with the true legacy of Silicon Valley, where superstars of past eras -- Intel, Apple, Sun, PeopleSoft, Oracle -- became insanely profitable companies. And this presumption is blinding us to what's happening in the Valley today. It's the same mistake that made you not buy Google shares at $85 in 2005, and you shouldn't let it blind you again.

I personally know of at least three highly profitable startups that may go public in the next 12 months, and may not. All three of them have been in business for over five years, and they are throwing off not just revenue but cash profit at a very impressive rate. To those three, I can add at least a dozen that I'm pretty sure are highly profitable as well, I just don't know for certain. All of these companies are the types of startups that are raising the "late stage rounds" that Stacey seems to believe must lead to exit or bankruptcy, because there is no third path.

Every startup that I've built has been intended from Day 1 not just to be transformative to its market, but to be a real, profitable business, with real customers. That's the true legacy of Silicon Valley, and the sooner we get our heads past the failed abberation of 1999-2001, the better.