Ever since Twitter disabled access to their firehose last year, many users have been waiting with baited breath for its return. The “firehose” refers to the stream of all public Twitter posts. Currently, it’s only possible to get a small subset of all public posts, and many types of Twitter applications aren’t possible without access to the firehose, such as real-time track and trend analysis.
For a time, Twitter planned to allow firehose access through a service called Gnip, but in October stated that they would instead work on providing access themselves. Since then, details have been sparse on the timeline or methods for which access will be given.
Last week however, some more information has been quietly released by means of an FAQ on the Twitter API website. Here’s the question and answer in full (emphasis theirs):
When will the firehose be ready?
By late January, early February 2009. For at least Q1 2009, the “firehose” (the near-realtime stream of all public status updates on Twitter) will only be available to a small group of trusted partners. The firehose is a stream HTTP solution; a client connects to it and the stream begins, ceasing only when the client disconnects. Once we’re confident in the stability of the service, we’ll add partners on a case-by-case basis. We may allow a wider selection of clients to consume subsets of the public stream (that is, updates from a collection of user IDs or matching specific search terms). We do not intend to allow anonymous, unregulated public access to this stream for any number of legal, financial, and technical reasons.
There’s a few pieces of new information here. First, that some kind of beta group will be given firehose access within a few weeks, using HTTP streams. This sounds similar to the solution provided by Gnip to their users.
Perhaps more important though is the news that a full public stream of the type previously provided will not be returning. While providing subsets of the public stream could be useful for things like groups, without the full firehose it’s very difficult (if not impossible) to provide a feature like real-time tracking, which has been eagerly awaited.
The three pieces of rationale for not making the firehose public (“legal, financial, and technical”) each bring up additional questions in turn. Legally, is there a difference between providing public tweets in a full stream and providing tweets publicly by user (which requires knowing the username ahead of time)? Do the financial motivations refer to saving money on servers and bandwidth, or by making money in providing access for a fee? And technically, are the existing solutions (such as HTTP streams or XMPP) insufficient for the task, and if so, how?