Links
jsoup 0.3.1 released »
I’ve just released version 0.3.1 of jsoup, the Java library for working with real-world HTML.
This version adds bulk HTML methods to the Elements collection, supports easy form validation of HTML user input, improves bulk attribute matching, and includes fixes for some minor bugs.
A hearty thanks to everyone that has tried jsoup and written in to me or to the mailing list with their experiences. Your input is directly shaping jsoup for the better.
A rant about PHP compilers in general and HipHop in particular »
I’ve heard the argument “you don’t need a compiler, since PHP is rarely the bottleneck” for many years. I think its complete bollox. But I wrote a compiler for PHP, so I would say that.
Unless your PHP server is sitting there idling (which is probably the case for many PHP servers out there), then you could make use of a PHP compiler. For small timers, all components of your application are going to be sitting on the same box, contending for the same resources. Even if you assume the DB is the bottleneck, the resources the interpreter consumes could be more profitably spent on the DB.
New version of jsoup released »
I’ve just released version 0.2.2 of jsoup. This release adds some new class name and HTML manipulation methods, improved document normalisation, and nicer HTML pretty-printing.
jsoup is now also available on the Maven central repository, so getting started is easier. See the details on the download page.
API design matters »
Michi Henning writes about the cost of bad APIs, and how to design good interfaces:
A great way to get usable APIs is to let the customer (namely, the caller) write the function signature, and to give that signature to a programmer to implement. This step alone eliminates at least half of poor APIs: too often, the implementers of APIs never use their own creations, with disastrous consequences for usability. Moreover, an API is not about programming, data structures, or algorithms—an API is a user interface, just as much as a GUI. The user at the using end of the API is a programmer—that is, a human being. Even though we tend to think of APIs as machine interfaces, they are not: they are human-machine interfaces.
Event-driven webserver Tornado is now open source »
FriendFeed has released Tornado, a Python non-blocking event-driven webserver and framework, as open source.
The framework is distinct from most mainstream web server frameworks (and certainly most Python frameworks) because it is non-blocking and reasonably fast. Because it is non-blocking and uses epoll, it can handle thousands of simultaneous standing connections, which means it is ideal for real-time web services. We built the web server specifically to handle FriendFeed’s real-time features — every active user maintains an open connection to the FriendFeed servers.
Lucene 2.9 Release Imminent »
Mark Miller reports that:
The third release candidate for Lucene 2.9 is about to hit and the final release is likely to be only days behind. Almost one year in the making, Lucene 2.9 is feature packed and progressively faster. With Solr 1.4 planning to release very shortly after 2.9, things are shaping up very nicely in Lucene land.
In anticipation of the Solr 1.4 release, Eric Pugh has announced that the first book on Solr, Solr 1.4 Enterprise Search Server, has been published and is available for purchase.
IE 6 and 7 to auto-update to IE8 »
Starting on or about the third week of April, users still running IE6 or IE7 on Windows XP, Windows Vista, Windows Server 2003, or Windows Server 2008 will get will get a notification through Automatic Update about IE8. This rollout will start with a narrow audience and expand over time to the entire user base. On Windows XP and Server 2003, the update will be High-Priority.
Users can decline the update, and Corporate IT groups can block it, but this is a promising move to bring users up to date, and so to increase web-development efficiency.
Amazon announces Elastic MapReduce »
Amazon Web Services have launched Elastic MapReduce, which is a cloud computing service for on-demand data processing. You’ve been able to do this at Amazon before by running Hadoop on EC2 instances, but this looks to wrap it all up in a convenient product, and make the dynamic scaling easier.
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit.
Languages supported: Java, Ruby, Perl, Python, PHP, R, and C++.
Preventing errors: Looking for ugly »
Kevin Kelly:
Preventing errors within extremely complicated technological systems is often elusive. The more complex the system, the more complex the pattern of error. But a curious thing happens in systems that are kept relatively error free: as major errors are prevented, it gets more difficult to forecast future major errors — because so few happen! In these kind of mission-critical systems the genesis profile of a major failure may be unknown because major failures are so rare.

