How much can be learned from 2.5 hours of prototyping?
I'm having an idea for a Twitter mesh-up. It's based on the idea that some tweets are worth more than others, and maybe there is a criteria to determine it.
Last week I had one free evening to work a little on this side project, and since it's mine alone, I'm free to blog about it. :)
First, I had to know if the idea can actually be implemented, and because I was not familiar with the Twitter API yet, I have thrown together a "proof-of-concept" in PHP.
- 1 hour: Twitter API reading, understanding
- 1.5 hour: Creating proof-of-concept in PHP
By the end of this time, I had a PHP webpage that had absolutely no design, but printed out tweets that it considered "high value" from the freshest twenty. During the development of this prototype, I already had a chance to improve about 50% on the accuracy of my idea by finding several "holes" where the algorithm returned false positives, and for the most grave of these holes, I already came up with simple workarounds.
If this would be a work for a client, I would need 2-3 more hours to write a summary about my findings. This blog post is not as detailed as that document would be, but here I jot down the same content for myself, for the next time I work on this idea:
Suggestions for tuning the algorithm:
- We have to be wary of infinite recursion
- More sophisticated filtering is needed (retweet bots, etc.)
- Spam filter integration would be nice, but the state of twitter SPAM filtering is not there yet. Maybe in half a year.
- Exact matches are usually useless, a little deviation is nice.
There would be a proposal for the architecture:
- The back end should not be web-based. Although the proof-of-concept is in a PHP webpage, a much better solution would be a Python daemon that continuously consumes Twitter's "firehose" of tweets.
- The daemon should do some simple filtering (eliminating about 90% of tweets), and put the promising ones into a queue.
- Since the actual work on these tweets can be done in parallel, but is rather lengthy compared to the speed tweets are coming, there should be multiple worker processes doing jobs from this queue.
- Maybe running on cloud servers that can be booted up as the queue fills up, and shut down as it empties.
- Therefore, a map/reduce solution (like the one in Python) should be the best to achieve this, using
- a fast key-value database, like Redis or MongoDB.
- The user interface should be web-based (of course), and should use a Comet server to get fresh data from the back end.
- I would also draw a suggested UI layout. (It would look like Seesmic Web.)
I would recommend Python over PHP for the following reasons (and over Java, for other reasons not included here):
- Python was designed with "classic" programs in mind, while PHP was always a "web language". Therefore, I would trust Python's garbage collection more when used in the long term.
- Like PHP, Python has all the necessary functionality for consuming the Twitter "firehose".
- Python is well-suited to run standalone HTTP server(s) for the Comet functionality.
I can already see several further optimizations that are not worth implementing in the first version, but may improve performance greatly should the need arise, like:
- We don't always need to un-shorten the URLs completely.
- After DNS resolving (that happens anyway when accessing an URL), we know which continent is the webserver on. We could use clouds in Europe, America, and Japan to minimize times spent in HTTP calls.
So, answering the original question, how is the situation of the project different after spending only 2.5 hours on throwing together a prototype?
- Before, it was not even sure that the idea can be implemented. Making an estimate on the effort needed would have been closer to black magic than to responsible quoting.
- After, I now have Twitter's API in my head, and reliably estimate how difficult it is to program for it. (It's pretty easy once you get used to it, actually. An estimate based on this knowledge would be much lower than before.)
- By working on actual tweets, certain properties of the idea came out early (like its relative slowness), that can easily be addressed by choosing the right architecture (as detailed above).
- The algorithm was already 50% more accurate than before.
Of course, for certain jobs (the "usual" kind) one can give pretty good estimates from the top of his head. But for experimental stuff like this, that has never been done before, a little effort (in this case, only 2.5 hours!) spent on starting to build a prototype could prove invaluable, and is well worth spending.