A rant about PHP compilers in general and HipHop in particular »
I’ve heard the argument “you don’t need a compiler, since PHP is rarely the bottleneck” for many years. I think its complete bollox. But I wrote a compiler for PHP, so I would say that.
Unless your PHP server is sitting there idling (which is probably the case for many PHP servers out there), then you could make use of a PHP compiler. For small timers, all components of your application are going to be sitting on the same box, contending for the same resources. Even if you assume the DB is the bottleneck, the resources the interpreter consumes could be more profitably spent on the DB.
New version of jsoup released »
I’ve just released version 0.2.2 of jsoup. This release adds some new class name and HTML manipulation methods, improved document normalisation, and nicer HTML pretty-printing.
jsoup is now also available on the Maven central repository, so getting started is easier. See the details on the download page.
jsoup HTML parser launches

Today, I am announcing the public beta launch of jsoup, an open source Java HTML parser that I have been working on recently.
jsoup is a Java library for working with real-world HTML:
- parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
jsoup is an open source project distributed under the liberal MIT license. Source code is available at GitHub.
As of this initial launch, jsoup is immediately useful, and it is in use in several internal projects. But of course it can be made more useful: so please, send me your suggestions and thoughts; either to the project’s mailing list, or to me directly.
If you would like to contribute code that would also be welcomed.
For more information, and to get started using jsoup, visit the project’s website.
API design matters »
Michi Henning writes about the cost of bad APIs, and how to design good interfaces:
A great way to get usable APIs is to let the customer (namely, the caller) write the function signature, and to give that signature to a programmer to implement. This step alone eliminates at least half of poor APIs: too often, the implementers of APIs never use their own creations, with disastrous consequences for usability. Moreover, an API is not about programming, data structures, or algorithms—an API is a user interface, just as much as a GUI. The user at the using end of the API is a programmer—that is, a human being. Even though we tend to think of APIs as machine interfaces, they are not: they are human-machine interfaces.
Event-driven webserver Tornado is now open source »
FriendFeed has released Tornado, a Python non-blocking event-driven webserver and framework, as open source.
The framework is distinct from most mainstream web server frameworks (and certainly most Python frameworks) because it is non-blocking and reasonably fast. Because it is non-blocking and uses epoll, it can handle thousands of simultaneous standing connections, which means it is ideal for real-time web services. We built the web server specifically to handle FriendFeed’s real-time features — every active user maintains an open connection to the FriendFeed servers.
Lucene 2.9 Release Imminent »
Mark Miller reports that:
The third release candidate for Lucene 2.9 is about to hit and the final release is likely to be only days behind. Almost one year in the making, Lucene 2.9 is feature packed and progressively faster. With Solr 1.4 planning to release very shortly after 2.9, things are shaping up very nicely in Lucene land.
In anticipation of the Solr 1.4 release, Eric Pugh has announced that the first book on Solr, Solr 1.4 Enterprise Search Server, has been published and is available for purchase.
Announcing Unicode Lookup
Over the weekend I built Unicode Lookup, a tool that lets you search for any Unicode character by name, or by codepoint number. A table of the characters with their decimal, octal, hex, and HTML entity representations is shown as results.
The core purpose of the tool is to aid web-development by making it easy to find the HTML entity for any character. It’s also useful for finding a character by class (e.g. math symbols) for copy & paste into documents.
As ASCII is a subset of Unicode, the tool also serves as a full ASCII (and Latin-1 etc) character reference.
Unicode Lookup is based on John Walker’s command-line tool unum.
IE 6 and 7 to auto-update to IE8 »
Starting on or about the third week of April, users still running IE6 or IE7 on Windows XP, Windows Vista, Windows Server 2003, or Windows Server 2008 will get will get a notification through Automatic Update about IE8. This rollout will start with a narrow audience and expand over time to the entire user base. On Windows XP and Server 2003, the update will be High-Priority.
Users can decline the update, and Corporate IT groups can block it, but this is a promising move to bring users up to date, and so to increase web-development efficiency.
Amazon announces Elastic MapReduce »
Amazon Web Services have launched Elastic MapReduce, which is a cloud computing service for on-demand data processing. You’ve been able to do this at Amazon before by running Hadoop on EC2 instances, but this looks to wrap it all up in a convenient product, and make the dynamic scaling easier.
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
Languages supported: Java, Ruby, Perl, Python, PHP, R, and C++.
Preventing errors: Looking for ugly »
Kevin Kelly:
Preventing errors within extremely complicated technological systems is often elusive. The more complex the system, the more complex the pattern of error. But a curious thing happens in systems that are kept relatively error free: as major errors are prevented, it gets more difficult to forecast future major errors — because so few happen! In these kind of mission-critical systems the genesis profile of a major failure may be unknown because major failures are so rare.

