Tagging Text Automatically
Classifications are important; we do better as users when we can use a well-structured classification scheme to find information we’re looking for. In the web 1.0 world, that played out in a hierarchy imposed by site designers (typified in the eternal sitemap) that the audience was forced to work with. Web 2.0 saw a move away from this predefined architecture and towards audience-defined taxonomies (sometimes called folksonomies) built on tags - end users associate tags with pieces of content, and use various mechanisms to navigate between similarly-tagged items.
The Problem
Tagging is a great strategy in certain circumstances, but it has a few important drawbacks. The one we’ll talk about here is the blank-state problem: if you’re relying entirely upon the audience to generate your tags, then new content in a system suffers an inherent disadvantage. When a browser comes to the site, they’ll explore the existing tag architecture, but they won’t find the new content (since it hasn’t been tagged yet). They may still be able to find it via some other mechanism (search, for example), but unless they then tag it the content will stay buried. It’s a rich-get-richer situation - well-tagged content will be found and tagged more often, while under-tagged content will not be found and will remain under-tagged.
The Solution
The obvious solution to this problem is to start all new content off with some starting tags - but that raises the question of where those tags come from. You could have the content creator enter them, and for small amounts of content with a distinct creator that's fine. But what if you have a system that has massive amounts of content entered at one time? Or a system in which content is generated automatically (through feeds, for instance)? The ideal solution here would be to automatically extract tags from the text itself - and as luck would have it, it is entirely possible to do just that.
It turns out that there are multiple web services out there that will take in your content and spit back keywords. The two most well-known of these are Tagthe.net (TTN) and Yahoo's Term Extractor (YTE) - they're both free, and they both work reasonably well (at Viget, we tend to use YTE - we find it produces slightly more relevant results and can return multi-word tags; we'll be implementing it in this example, though a TTN example is very similar).
Implementation: The Basics
The first step is to get an application ID from Yahoo! - it's essentially an API key that identifies your application. You can get one here (you may have to log in or register for an account with Yahoo!).
Once you've got your application ID, it's as easy as POSTing your content to the Term Extractor URL (at http://search.yahooapis.com/ContentAnalysisService/V1/termExtraction). You'll get an XML response with your tags in a ResultSet (each tag is a Result).
Implementation: The Ruby Way
Here's the full example from a working Rails application. We drop this code into a TagExtractor class and call it with TagExtractor.extract(text).
class TagExtractor
APP_KEY = 'key'
API_SITE_URL = 'api.search.yahoo.com'
API_PAGE_URL = '/ContentAnalysisService/V1/termExtraction';
require 'net/http'
require 'rexml/document'
require 'uri'
# public wrapper for the retrieve and parse process
def self.extract(text)
options = Hash.new
options[:context] = text
tag_xml = retrieve(options)
parse(tag_xml)
end
private
# pass the content to YTE for term extraction
def self.retrieve(options)
options['appid'] = APP_KEY
res = nil
Net::HTTP.start(API_SITE_URL) do |http|
req = Net::HTTP::Post.new(API_PAGE_URL)
req.form_data = options
res = http.request(req)
end
res.body
end
# parse the XML returned from YTE into an array of tags
def self.parse(xml)
tags = Array.new
doc = REXML::Document.new(xml)
doc.elements.each("*/Result") do |result|
tags << result.text
end
tags
end
end

Tyrant is a "meta" Rails application designed to run other Rails applications.
Recent Comments
Tony,
I understand and agree that the back-end shouldn’t output code (html code), and only content. The templates (aka views) should do the trick, but instead of having lot’s of if/else conditionals inside the view, you may just output the following content.
No information available
The template would loop in an array and put all the <li>’s inside the <ul>.
I don’t see anything wrong, nor...