Bayesian Categorisers and Preference Maps

Jon Udell writes: “There’s been some discussion in the blog world about using a Bayesian categorizer to enable a person to discriminate along various interest/non-interest axes. I took a run at this recently and, although my experiments haven’t been wildly successful, I want to report them because I think the idea may have merit…We know that autocategorization succeeds in the narrow domain of spam filtering. Whether it can succeed more generally — for example, by helping blog authors and readers manage flows of items — is yet unclear. The raw tools are available, but until they’re well integrated into authoring and reading software, it will be hard to get a good sense of what’s possible.”

Some additional thoughts from Udell:

First, from the perspective of a blog author who already categorizes content (as many do), the question is: can effort that’s already being invested pay more dividends? An automated review of things that have been already been categorized can help you sharpen your sense of the structure you are building. A prediction about how to categorize a newly-written item can be interesting and helpful too. As I worked through the exercise, I could (at times) imagine the software to be acting like a person you’d bounce an idea off of. “I can see why you choose that category,” we can imagine it saying, “but for what it’s worth, it has a lot in common with these items in this other category.”

The second and even more speculative idea would be to create subscribable filters. Consider the set of items that I write myself, and categorize under, say, web_services. Some other set of items out there in the blogosphere, written by other folks, will tend to cluster with mine. Could we say that those other items have some affinity for “Jon’s take on Web services”? And if so, by subscribing to my text-frequency database for that category could you use it to create one view of your own inbound feeds, or to suggest ones you’re not reading?

Matt Mower follows it up with an interesting thought: “What might be interesting is if people could “share” and “subscribe to” preference maps. As a new user of the system you might not really know who is relevant on any particular topic. But imagine you worked with David Weinberger, Phil Wolff, or Dan Gillmor. If you knew them and trusted their judgement you could pick one of their preference maps as a starting point and immediately gain a usseful insight into the data as it is structured by topic. You might even switch between personalities to get more perspective!”

Recipe Web

Les Orchard has some interesting ideas on building out [1 2] a microcontent client for recipes, based on RecipeML: “The real strength in a recipe web would come from cooking bloggers. Supply them with tools to generate RecipeML, post them on a blog server, and index them in an RSS feed. Then, geeks get to work building the recipe aggregators…Since I’d really like to play with some RDF concepts, maybe I’ll write some adaptors to munge RecipeML and MealMaster into RDF recipe data. Cross that with FOAF and other RDF whackyness, and build an empire of recipe data.”

A respone from Troy Hakala:

We (Recipezaar) wrote a natural language recipe parser to make this possible and it’s a difficult job.

Imagine a world of XML recipes distributed around the web on weblogs. An aggregator would need to aggregate millions of weblogs just to cull together a few hundred or thousand recipes. Now imagine millions of aggregator users doing this daily or hourly the way they do this today for weblogs. And if a weblogger had 1,000 recipes on their weblog archives, they wouldn’t want millions of aggregators eating their bandwidth every day to maintain the database for each individual using an aggregator (webloggers today already complain about aggregators costing them too much money in bandwidth costs). Additionally, 99.999% of people who create recipes are unlikely to have a weblog to post their XML recipes so you’d lose the majority of the potential content.

A centralized repository provides a place for regular users to post their recipes and get them seen by the most number of people. And a centralized repository provides an easy way to search for recipes, browse for recipes, review & rate recipes, discuss recipes, etc. And let’s talk numbers…. today, Recipezaar has 73,000 recipes in the database and, while it’s the largest database of recipes on the internet, people still can’t find a particular recipe because there is an infinite number of possible recipes that can be created. Having a few hundred or a few thousand recipes is not a useful database to people. More is better. And acquiring more via an aggregator is a big and expensive job.

Distributed databases are useful in some contexts and centralized databases are useful in other contexts. Each has their own advantages and disadvantages, but like auctions, recipes are best stored centrally where everyone has access to them.

Adds Orchard:

If the people behind RecipeZaar like the idea, is to borrow their parser via web service for use in my hypothetical MovableType plugin. This could also be used for any number of other blogging tools. On the upside, we get the benefit of all the work done by Troy and company, and they get to pull in more recipes. On the downside, were dependant on a web service not under our control for the basic functionality of this plugin.

Im excited to see more varieties of micro-content shared between the people of the web, but the thing I see least talked about is how this stuff will be authored. I read about data formats and all that, but in terms of user interface, we havent progressed much past the HTML textarea. Also, I often see handwaving and assumptions that the content is really pretty simple — but as Troy Hakala would tell you, not even something as simple as a recipe is a slam dunk in terms of digestion by a machine. There needs to be some happy medium between a natural human expression of information, and the rigorous structuring required by a machine, mediated by good user interface.

As I read all this, I couldn’t help thinking that we need is an Information Marketplace. I think I have to speed up the thinking and just get it done. There are many areas I can now think of applying it: for SMEs to find each other, an IndiaMirror and now recipes.

Enterprise Blogging and RSS Ideas

From Robert Scoble

For instance, I have a vision of a day when every single Microsoft employee will have a weblog. Now, what happens when you have 55,000 people weblogging inside of a corporation? Well, for one, I want to see weblogs in different ways? Why shouldn’t it be possible to see results from a search engine in order of where you are on the org chart, for instance? So, how can you match RSS data up with your domain data that’s stored in Exchange and/or other corporate data stores?

How about seeing data from corporate webloggers based on revenues? Or other metrics?

Also, one thing I miss is being able to tell readers what I think are my most important items. Look at the function of a newspaper designer. That guy plays a huge amount of value. Look at your average newspaper. You know that the biggest and top-most headline is what the newspaper has decided is the most important story. But, in weblogging we don’t have that ability. You get my 60 posts and you have no idea which ones of those 60 that I think are most important.

In fact, you not only don’t have any idea which ones I find are most important, but you have no idea which ones my readers think are most important. The only clue you have is how many comments, or how many links a certain article has (and discovering how many links a certain article has is very tough unless I enable trackback which I haven’t done cause it slowed down my page loads and had other problems).

…and Mitch Radcliff:

Robert is very articulate — one has to be inside Microsoft, the institutional equivalent of a Darwinian pool — about how the ability to discover what content is new is one of the key features of blogging. It doesn’t exist in other Web page layouts or within corporate applications where many people may be performing the same queries and need to know about similar interests/concerns visually; this is the heart of all the talk about the semantic Web. It’s simple in blogging to find what is new and, through trackback, what’s capturing attention, either the new content is at the top of the page or it is in the most recent RSS feed. That’s probably the most important benefit of what blogs have done, making it easy to author, share and debate information; it will obviously migrate into other applications, which is where the leading edge will be when everyone “gets” blogging as it is today.

In a page layout, which is how most people and organizations demonstrate what information is most important, there are structural, design and semantic elements we understand: “Important information ss placed at the top of the page, yet a story may stay “important” longer after its initial publication, a characteristic lost in blogging, which replaces the last “top story” with another based on chronological posting; the size and word choice in headlines convey a great deal of information, which is lost in an RSS feed.

So, we were speculating about the need for an RSS 3.0 that adds those features, including page placement metadata, so that the simplicity of blogging can be combined with the cues we’re used to in page layout. Imagine a page layout where a new or changed story blinked or glowed momentarily after a page loaded to indicate that it is new, yet the page still looked like a newspaper, report or other standard page.

RSS 3.0 would need to include an interpreter that processed changes, like a wiki page does diffs; a page would, essentially, need to read its own RSS feed. The result would be a dramatically richer Web, not better blogging or a better browser in and of itself. Since desktop publishing has gone through this kind of evolution, not to mention the management of versioning in code, so that groups can share information in context, this seems like a natural direction to go. The simplicity and discoverability of blogs should migrate into harder to use applications.

It could also include trackback analysis to display what is being linked to most. Positive and negative sentiment could be recorded, too.

In fact, I think Traction would suit well – it has a nice feature which lets you create the equivalent of a Front Page for every user.

UNCTAD E-Commerce and Development Report

[via Smart Mobs] Here. “This new edition analyses, from a development perspective, recent trends and advances in information and communication technologies (ICT), such as e-commerce and e-business, and examines their applications in developing countries. The report proposes strategic options to assist developing countries in designing national policies to take advantage of ICT.”

Modifying Information Offline

Adam Bosworth continues his description of how to build a web services browser in an intermittently connected world:

this new browser I’m imagining doesn’t navigate across pages found on the server addressed by URL’s. It navigates across cached data retrieved from Web Services. It separates the presentation – which consists of an XML document made up of a set of XHTML templates and metadata and signed script – from the content which is XML. You subscribe to a URL which points to the presentation. This causes the XML presentation document to be brought down, the UI to be rendered, and it starts the process of requesting data from the web services. As this data is fetched, it will be cached on the client. This fetching of the data normally will run in the background just as mail and calendar on the Blackberry fetch the latest changes to my mail and calendar in the background. The data the user initially sees will be the cached data. Other more recent or complete information, as it comes in from the Internet, will dynamically “refresh” the running page or, if the page is no longer visible, will refresh the cache.

I recommend that the model is that, in general, data isn’t directly modified. Instead, requests to modify it (or requests for a service) are created. For example, if you want to book a restaurant, create a booking request. If you want to remove a patient from a clinical trial, create a request to do so. If you want to approve an expense report, create a request to approve it. Then relate these requests to the item that they would modify (or create) and show, in some iconographical manner, one of 4 statuses:
1) A request has been made to alter the data but it hasn’t even been sent to the internet.
2) A request has been sent to the Internet, but no reply has come back yet.
3) The request has been approved
4) The request has been denied.

the important thing is that it works really well even when the connection is poor because all changes respond immediately by adding requests, thus letting the user continue working, browsing, or inspecting other related data. By turning all requests to alter data into data packets with the request, the user interface can also decide whether to show these overtly (as special outboxes for example or a unified outbox) or just to show them implicitly by showing that the altered data isn’t yet “final” or even not to alter any local data at all until the requests are approved.

TECH TALK: An Entrepreneurs Attributes: Experimentation Trying New Things

An entrepreneur must be an experimenter, constantly trying out different things and exploring alternate avenues. Many of the experiments may fail, but out of these will arise learnings. Experimentation is what leads to innovation.

Inc has a review of a new book by Stefan Thomke on this very topic: Experimentation Matters. Inc summarises the six principles outlined by Thomke on managing the experimentation process:

1. Anticipate and exploit early information through “front-loaded” innovation processes. Thomke explains how there is much value in finding potential failures as early as possible. Considering the vast expense of late-stage failures, whether they are in drug experiments, software development, automobile crash simulations, or aircraft development, using new technologies early in R&D projects helps teams avoid potential problems downstream. Examples from Microsoft, Boeing and Toyota show how millions of dollars can be saved through early experimentation.

2. Experiment frequently but do not overload your organization. Although many early tests can minimize problem-solving delays and costs of redesign, organizations must be ready to handle the increasing amount of information that the experimentation will bring. Thomke uses an extensive and detailed case study from BMW to highlight this principle.

3. Integrate new and traditional technologies to unlock performance. New technologies can create impressive results, but they are not perfect and are not stand-alone techniques. Thomke writes, “To unlock their potential, a company must understand not only how new and traditional technologies can coexist within such a process but also how they enhance and complement each other.”

4. Organize for rapid experimentation. The ability to experiment quickly is an important component to effective learning. Since virtual experimentation brings organizations information earlier, managers are able to use results to guide their decisions about the use of major resources and avoid reworking bad designs after a company has committed itself to them. Thomke shows how rapid experimentation helped BMW learn how to make cars safer.

5. Fail early and often but avoid “mistakes.” New ideas are bound to fail, so early failures help to eliminate unfavorable options quickly and facilitate learning. Failures can produce new and useful information.

6. Manage projects as experiments. Leaders should have a portfolio of experimental projects from which they can learn that are managed with the same seriousness that is applied to other business processes. Using a project as a learning experiment and an agent of change can help a company investigate diverse concepts.

For me, experimentation is another word for entrepreneurship. Let me give a personal example. During my IndiaWorld days, we created 13 India-centric websites 9 of these did not work, but 4 of them (Samachar, Khoj, Khel and Bawarchi). When we started, little did I know which ones would work and which would not. The approach we took was to try out our new ideas, and keep the cost of experimentation low, till we got preliminary feedback from our readers. We were willing to fail, and that is why we succeeded.

Tomorrow: Value-Added Aggregation, Knowledge

Continue reading