<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Jonathan Hedley &#187; web development</title>
	<atom:link href="http://jonathanhedley.com/tag/web-development/feed" rel="self" type="application/rss+xml" />
	<link>http://jonathanhedley.com</link>
	<description>Winning at everything so that you don&#039;t have to.</description>
	<lastBuildDate>Wed, 18 Aug 2010 10:25:32 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>jsoup HTML parser launches</title>
		<link>http://jonathanhedley.com/articles/2010/01/jsoup-html-parser-launches</link>
		<comments>http://jonathanhedley.com/articles/2010/01/jsoup-html-parser-launches#comments</comments>
		<pubDate>Sun, 31 Jan 2010 06:44:36 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[html]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jsoup]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=218</guid>
		<description><![CDATA[Today, I am announcing the public beta launch of jsoup, an open source Java HTML parser for dealing with real-world HTML.]]></description>
			<content:encoded><![CDATA[<p><a href="http://jsoup.org/"><img src="http://static.jonathanhedley.com/2010/01/jsoup-html-parser.png" alt="jsoup HTML parser screenshot"width="430" height="240" style="border: 1px solid black; margin-bottom: 5px" class="side-cookbook" /></a><br />
Today, I am <a href="http://jsoup.org/news/jsoup-launches">announcing</a> the public beta launch of <code><strong><a href="http://jsoup.org/" title="jsoup Java HTML parser">jsoup</a></strong></code>, an open source Java HTML parser that I have been working on recently.</p>
<p><code>jsoup</code> is a Java library for working with real-world HTML:</p>
<ul>
<li>parse HTML from a URL, file, or string</li>
<li>find and extract data, using DOM traversal or CSS selectors</li>
<li>manipulate the HTML elements, attributes, and text</li>
<li>clean user-submitted content against a safe white-list</li>
</ul>
<p>jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.</p>
<p>jsoup is an <strong>open source project</strong> distributed under the liberal <a href="http://jsoup.org/license">MIT license</a>. Source code is available at <a href="http://github.com/jhy/jsoup/">GitHub</a>.</p>
<p>As of this initial launch, jsoup is immediately useful, and it is in use in several internal projects. But of course it can be made more useful: so please, send me your suggestions and thoughts; either to the project&#8217;s <a href="http://jsoup.org/discussion">mailing list</a>, or to <a href="http://jonathanhedley.com/contact">me directly</a>.</p>
<p>If you would like to contribute code that would also be welcomed.</p>
<p>For more information, and to get started using jsoup, visit the <a href="http://jsoup.org/">project&#8217;s website</a>.</p>
<div class="rhs">
<div class="side-cookbook">
<p><a href="http://jsoup.org/cookbook/extracting-data/selector-syntax">Use selector-syntax to find elements</a></p>
</div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/articles/2010/01/jsoup-html-parser-launches/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Event-driven webserver Tornado is now open source</title>
		<link>http://jonathanhedley.com/links/2009/09/tornado-webserver</link>
		<comments>http://jonathanhedley.com/links/2009/09/tornado-webserver#comments</comments>
		<pubDate>Sat, 12 Sep 2009 02:37:49 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Links]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[tornado]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=207</guid>
		<description><![CDATA[FriendFeed has released Tornado, a Python non-blocking event-driven webserver and framework, as open source.

The framework is distinct from most mainstream web server frameworks (and certainly most Python frameworks) because it is non-blocking and reasonably fast. Because it is non-blocking and uses epoll, it can handle thousands of simultaneous standing connections, which means it is ideal [...]]]></description>
			<content:encoded><![CDATA[<p>FriendFeed has released <a href="http://www.tornadoweb.org/">Tornado</a>, a Python non-blocking event-driven webserver and framework, as open source.</p>
<blockquote><p>
The framework is distinct from most mainstream web server frameworks (and certainly most Python frameworks) because it is non-blocking and reasonably fast. Because it is <span class="sb-non-blocking">non-blocking</span> and uses epoll, it can handle thousands of simultaneous standing connections, which means it is ideal for real-time web services. We built the web server specifically to handle FriendFeed&#8217;s real-time features — every active user maintains an open connection to the FriendFeed servers.
</p></blockquote>
<div class="sidebar">
<p class="sb-non-blocking"><a href="http://en.wikipedia.org/wiki/Non-blocking_synchronization">Non-blocking synchronization</a> ensures that threads competing for a shared resource do not have their execution indefinitely postponed by mutual exclusion.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/links/2009/09/tornado-webserver/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Lucene 2.9 Release Imminent</title>
		<link>http://jonathanhedley.com/links/2009/09/lucene-2-9-release</link>
		<comments>http://jonathanhedley.com/links/2009/09/lucene-2-9-release#comments</comments>
		<pubDate>Mon, 07 Sep 2009 04:19:26 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Links]]></category>
		<category><![CDATA[enterprise architecture]]></category>
		<category><![CDATA[lucene]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=183</guid>
		<description><![CDATA[





Mark Miller reports that:
The third release candidate for Lucene 2.9 is about to hit and the final release is likely to be only days behind. Almost one year in the making, Lucene 2.9 is feature packed and progressively faster. With Solr 1.4 planning to release very shortly after 2.9, things are shaping up very nicely [...]]]></description>
			<content:encoded><![CDATA[<div class="left-pull plain-bg">
<a href="http://www.amazon.com/gp/product/1847195881?ie=UTF8&#038;tag=904351-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=1847195881" class="side-book"><br />
<img src="http://static.jonathanhedley.com/2009/09/solr-enterprise-search.jpg" alt="SOLR Enterprise Search book" width="130" height="160" /><br />
</a><br />
<img src="http://www.assoc-amazon.com/e/ir?t=904351-20&#038;l=as2&#038;o=1&#038;a=1847195881" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />
</div>
<p><cite><a title="Posts by Mark Miller" href="http://www.lucidimagination.com/blog/author/markmiller/">Mark Miller</a></cite> reports that:</p>
<blockquote><p>The third release candidate for <a href="http://lucene.apache.org/">Lucene</a> 2.9 is about to hit and the final release is likely to be only days behind. Almost one year in the making, Lucene 2.9 is feature packed and progressively faster. With <a href="http://lucene.apache.org/solr/">Solr</a> 1.4 planning to release very shortly after 2.9, things are shaping up very nicely in Lucene land.</p></blockquote>
<p>In anticipation of the Solr 1.4 release, <a href="http://www.opensourceconnections.com/2009/08/19/solr-1.4-enterprise-search-server-book-is-released/">Eric Pugh</a> has announced that the first book on Solr, <a href="http://www.amazon.com/gp/product/1847195881?ie=UTF8&#038;tag=904351-20&#038;linkCode=as2&#038;camp=1789&#038;creative=390957&#038;creativeASIN=1847195881">Solr 1.4 Enterprise Search Server</a>, has been published and is available for purchase.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/links/2009/09/lucene-2-9-release/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Announcing Unicode Lookup</title>
		<link>http://jonathanhedley.com/articles/2009/04/announcing-unicode-lookup</link>
		<comments>http://jonathanhedley.com/articles/2009/04/announcing-unicode-lookup#comments</comments>
		<pubDate>Thu, 16 Apr 2009 11:40:46 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[file formats]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[web development]]></category>
		<category><![CDATA[web standards]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=177</guid>
		<description><![CDATA[Over the weekend I built <a href="http://unicodelookup.com/"><strong>Unicode Lookup</strong></a>, a tool that lets you search for any Unicode character by name, or by codepoint number. A table of the characters with their decimal, octal, hex, and HTML entity representations is shown as results.]]></description>
			<content:encoded><![CDATA[<p><a href='http://unicodelookup.com/'><img src="http://static.jonathanhedley.com/2009/04/unicode-lookup-table.png" alt="Unicode character result table" title="Unicode Lookup" width="430" height="240" style="border: 1px solid black" /></a></p>
<p>Over the weekend I built <a href="http://unicodelookup.com/"><strong>Unicode Lookup</strong></a>, a tool that lets you search for any Unicode character by name, or by codepoint number. A table of the characters with their decimal, octal, hex, and HTML entity representations is shown as results.</p>
<p>The core purpose of the tool is to aid web-development by making it easy to find the HTML entity for any character. It&#8217;s also useful for finding a character by class (e.g. math symbols) for copy &#038; paste into documents.</p>
<p>As ASCII is a subset of Unicode, the tool also serves as a full ASCII (and Latin-1 etc) character reference.</p>
<p><a href="http://unicodelookup.com/">Unicode Lookup</a> is based on John Walker&#8217;s command-line tool <a href="http://www.fourmilab.ch/webtools/unum/"><code>unum</code></a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/articles/2009/04/announcing-unicode-lookup/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>IE 6 and 7 to auto-update to IE8</title>
		<link>http://jonathanhedley.com/links/2009/04/ie8-automatic-update</link>
		<comments>http://jonathanhedley.com/links/2009/04/ie8-automatic-update#comments</comments>
		<pubDate>Mon, 13 Apr 2009 00:07:35 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Links]]></category>
		<category><![CDATA[ie8]]></category>
		<category><![CDATA[microsoft]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=176</guid>
		<description><![CDATA[Starting on or about the third week of April, users still running IE6 or IE7 on Windows XP, Windows Vista, Windows Server 2003, or Windows Server 2008 will get will get a notification through Automatic Update about IE8. This rollout will start with a narrow audience and expand over time to the entire user base. [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>Starting on or about the third week of April, users still running IE6 or IE7 on Windows XP, Windows Vista, Windows Server 2003, or Windows Server 2008 will get will get a notification through Automatic Update about IE8. This rollout will start with a narrow audience and expand over time to the entire user base. On Windows XP and Server 2003, the update will be High-Priority.</p></blockquote>
<p>Users can decline the update, and Corporate IT groups can block it, but this is a promising move to bring users up to date, and so to increase web-development efficiency.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/links/2009/04/ie8-automatic-update/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Amazon adds sort feature to SimpleDB</title>
		<link>http://jonathanhedley.com/links/2008/08/amazon-adds-sort-feature-to-simpledb</link>
		<comments>http://jonathanhedley.com/links/2008/08/amazon-adds-sort-feature-to-simpledb#comments</comments>
		<pubDate>Fri, 01 Aug 2008 00:34:26 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Links]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[simpledb]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=163</guid>
		<description><![CDATA[Amazon AWS SimpleDB now supports sortable query result sets. Previously query results came back in insertion order only, but now you can sort on (only) one attribute. This makes a lot of standard relational DB use-cases more feasible for implementation in SimpleDB, as it makes for less data post-processing. 
Sorting on only one attribute is [...]]]></description>
			<content:encoded><![CDATA[<p>Amazon AWS SimpleDB now supports <a href="http://docs.amazonwebservices.com/AmazonSimpleDB/2007-11-07/DeveloperGuide/index.html?SortingData.html">sortable query result sets</a>. Previously query results came back in insertion order only, but now you can sort on (only) one attribute. This makes a lot of standard relational DB use-cases more feasible for implementation in SimpleDB, as it makes for less data post-processing. </p>
<p>Sorting on only one attribute is still quite limiting, though, and queries still only return object IDs, which forces many further queries to retrieve the full data-set.</p>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/links/2008/08/amazon-adds-sort-feature-to-simpledb/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Review: Programming Collective Intelligence</title>
		<link>http://jonathanhedley.com/articles/2008/05/programming-collective-intelligence</link>
		<comments>http://jonathanhedley.com/articles/2008/05/programming-collective-intelligence#comments</comments>
		<pubDate>Sun, 04 May 2008 06:44:29 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[reading list]]></category>
		<category><![CDATA[review]]></category>
		<category><![CDATA[semantic analysis]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=132</guid>
		<description><![CDATA[Programming Collective Intelligence is a book about applying data mining techniques to analyse collections of data. There is submerged information in Ebay prices, in Facebook profile networks, in collections of movie reviews, in news sites, in the stockmarket; this book by Toby Segaran shows ways to extract, visualise, understand, and predict that information.]]></description>
			<content:encoded><![CDATA[<div class="left-pull thumb"><a href="http://www.amazon.com/dp/0596529325?tag=904351-20&amp;camp=0&amp;creative=0&amp;linkCode=as1&amp;creativeASIN=0596529325&amp;adid=0D9S71JN6Q4F6ZSA3V5R&amp;"><img src="http://static.jonathanhedley.com/2008/05/programming-collective-intelligence2.jpg" border="0" alt="programming collective intelligence" width="208" height="274" /></a></div>
<p><a href="http://www.amazon.com/dp/0596529325?tag=904351-20&amp;camp=0&amp;creative=0&amp;linkCode=as1&amp;creativeASIN=0596529325&amp;adid=0D9S71JN6Q4F6ZSA3V5R&amp;">Programming Collective Intelligence</a> is a book about applying data mining techniques to analyse collections of data. There is submerged information in Ebay prices, in Facebook profile networks, in collections of movie reviews, in news sites, in the stockmarket; this book by <span class="ts">Toby Segaran</span> shows ways to extract, visualise, understand, and predict that information.</p>
<p>Each chapter explains and explores a different data mining algorithm, and builds up a working example in Python, while presenting different methods and parameters of the implementation. I hadn&#8217;t really worked with Python before, but found the code easy to follow, and picked up some interesting Python idioms that I haven&#8217;t seen in other languages before. Chapters end with a set of exercises to follow that build your understanding.</p>
<p>As you follow the examples you build up a reasonably generic code base that allows you to swap in and out different implementations, and reuse previous code to add to new applications.</p>
<p>The examples use live examples from the web: sites like Ebay, Facebook, and Yahoo Finance, and this makes the book more interesting and the results more visceral than some other books on the subject which use more contrived or obscure examples. Even though there is a strong web (or web 2.0) focus on the examples, the methods and the understanding is useful for a whole range of applications.</p>
<p>Some of the topics covered:</p>
<ul>
<li>Bayesian classifiers to detect spam, or to file news articles into site sections</li>
<li>Hierarchical and k-means clustering to discover groups of similar items in massive sets</li>
<li>Euclidiean distance, Pearson Correlation Coefficient, Tanimoto Coefficient: ways to measure the distance (or difference) between items</li>
<li>Neural networks to predict user behaviour and improve search result ordering</li>
<li>Optimisation methods like hill climbing, simulated annealing, and genetic algorithms</li>
<li>Non-negative matrix factorization</li>
<li>Support vector machines and kernel methods to go where linear regression can&#8217;t</li>
</ul>
<p>I found it exciting to read &#8212; it&#8217;s one of those books that give you a whole bunch of new ideas for things to build as you read it. The presentation is very good: no background is assumed, and it doesn&#8217;t talk down to those more experienced.</p>
<p>Recommended.</p>
<div class="rhs">
<div class="ts"><a href="http://blog.kiwitobes.com/">The author&#8217;s blog</a></div>
</div>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/articles/2008/05/programming-collective-intelligence/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Amazon adds persistent storage to EC2</title>
		<link>http://jonathanhedley.com/links/2008/04/amazon-adds-persistent-storage-to-ec2</link>
		<comments>http://jonathanhedley.com/links/2008/04/amazon-adds-persistent-storage-to-ec2#comments</comments>
		<pubDate>Mon, 14 Apr 2008 05:42:32 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Links]]></category>
		<category><![CDATA[aws]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/?p=94</guid>
		<description><![CDATA[Amazon is adding persistent storage as an option to EC2 &#8212; currently it&#8217;s in private beta.
Previously, disk storage on an EC2 was transient:- when the machine was shut down or crashed, it felt like a hard drive crash. (And you&#8217;d lose your IP address too, but Amazon added static IPs a little while ago too.) [...]]]></description>
			<content:encoded><![CDATA[<p>Amazon is <a href="http://aws.typepad.com/aws/2008/04/block-to-the-fu.html">adding persistent storage</a> as an option to EC2 &#8212; currently it&#8217;s in private beta.</p>
<p>Previously, disk storage on an EC2 was transient:- when the machine was shut down or crashed, it felt like a hard drive crash. (And you&#8217;d lose your IP address too, but <a href="http://jonathanhedley.com/links/2008/03/ec2-static-ip-addresses">Amazon added static IPs</a> a little while ago too.) <span class="davfs">The path to reliability was to use S3, but that can&#8217;t be mounted as a native file system.</span></p>
<p>The persistent storage appears as a raw, mountable filesystem that needs to be formatted. You&#8217;ll be able to make a quick snapshot of the data, for backup. No word on pricing or its performance, but you&#8217;d expect it to be aligned with S3.</p>
<div class="rhs">
<p class="davfs">There&#8217;s been the option of mounting <a href="http://www.cantinaconsulting.com/2007/12/08/amazon-ec2-first-impressions-mounting-s3/">S3 in EC2 using davfs</a>, which mounts with WebDAV, but that&#8217;s a bit of a hack and one wonders what the performance would be like.</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/links/2008/04/amazon-adds-persistent-storage-to-ec2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scalr: auto-scaling web app hosting in EC2</title>
		<link>http://jonathanhedley.com/links/2008/04/scalr-auto-scaling-web-app-hosting-in-ec2</link>
		<comments>http://jonathanhedley.com/links/2008/04/scalr-auto-scaling-web-app-hosting-in-ec2#comments</comments>
		<pubDate>Sat, 05 Apr 2008 06:19:46 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Links]]></category>
		<category><![CDATA[cloud computing]]></category>
		<category><![CDATA[ec2]]></category>
		<category><![CDATA[scalr]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/links/2008/04/scalr-auto-scaling-web-app-hosting-in-ec2</guid>
		<description><![CDATA[Scalr is a fully redundant, self-curing and self-scaling hosting environment utilizing Amazon&#8217;s EC2.
It allows you to create server farms through a web-based interface using prebuilt AMI&#8217;s for load balancers (pound or nginx), app servers (apache, others), databases (mysql master-slave, others), and a generic AMI to build on top of.

The project is still very young, but [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p><a href="http://code.google.com/p/scalr/">Scalr</a> is a fully redundant, self-curing and self-scaling hosting environment utilizing <span class="ec2">Amazon&#8217;s EC2.</span></p>
<p>It allows you to create server farms through a web-based interface using prebuilt AMI&#8217;s for load balancers (pound or nginx), app servers (apache, others), databases (mysql master-slave, others), and a generic AMI to build on top of.</p>
</blockquote>
<blockquote class="intridea"><p>The project is still very young, but we&#8217;re hoping that by open sourcing it the AWS development community can turn this into a robust hosting platform and give users an alternative to the current fee based services available.</p></blockquote>
<p>This looks like it could be great when it develops. I kind of think that Amazon themselves should be providing this kind of executive service to auto-scale and -heal an application deployed in their grid (and wouldn&#8217;t be surprised if they add it as their service matures).</p>
<div class="rhs">
<p class="ec2"><a href="http://www.amazon.com/EC2-AWS-Service-Pricing/b/ref=sc_fe_l_2?ie=UTF8&amp;node=201590011&amp;no=3440661">Elastic Compute Cloud</a></p>
<p class="intridea">&#8220;We&#8221; being <a href="http://www.intridea.com/">Intridea</a>, a web dev shop</p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/links/2008/04/scalr-auto-scaling-web-app-hosting-in-ec2/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How-to: Optimize your site for speed</title>
		<link>http://jonathanhedley.com/articles/2008/04/guide-to-website-speed-optimization</link>
		<comments>http://jonathanhedley.com/articles/2008/04/guide-to-website-speed-optimization#comments</comments>
		<pubDate>Tue, 01 Apr 2008 10:56:56 +0000</pubDate>
		<dc:creator>Jonathan Hedley</dc:creator>
				<category><![CDATA[Articles]]></category>
		<category><![CDATA[how-to]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[site optimization]]></category>
		<category><![CDATA[tips]]></category>
		<category><![CDATA[web development]]></category>

		<guid isPermaLink="false">http://jonathanhedley.com/articles/2008/04/guide-to-website-speed-optimization</guid>
		<description><![CDATA[Does your website load as quickly as you -- and your users -- would like? If not, here's a detailed set of proven guidelines aimed at improving the speed of your site.]]></description>
			<content:encoded><![CDATA[<p><em>Does your website load as quickly as you &#8212; and your users &#8212; would like? If not, here&#8217;s a detailed set of proven guidelines aimed at improving the speed of your site.</em></p>
<p>The benefits of speed optimized pages:</p>
<ol>
<li>Your visitors will be happier, and will feel much more engaged on a snappy site than a slow one. User interface responsiveness is a very large contributing factor to how users trust your content and company. Users trust and enjoy fast sites, and are quickly frustrated by slow sites.</li>
<li>The faster that you can serve content to your visitors, the faster they&#8217;ll be off your servers, leaving them free to serve the next visitor.  That means that you can handle greater traffic loads, with less hardware.</li>
<li>Smaller overall downloads means lower bandwidth bills at the end of the month.</li>
</ol>
<p class="project">You can often get quite massive improvements with just a few tweaks. A few years ago I worked on a project for the <a href="http://www.smh.com.au" class="smh">Sydney Morning Herald</a> that reduced the time to display the first story on the homepage on a modem from around 17 seconds down to 2 seconds. Broadband connections had a similar relative speed improvement.</p>
<p><span id="more-54"></span></p>
<h2>Summary</h2>
<p>By order of biggest improvements first:</p>
<ol>
<li>Reduce the total number of HTTP requests
<ol>
<li>Build style sheet and JavaScript libraries</li>
<li>Combine images into CSS sprites</li>
<li>Enable intelligent caching: send expiry headers, and support cache validation</li>
</ol>
</li>
<li>Support progressive page rendering
<ol>
<li>CSS at the top of the page, JavaScript at the bottom</li>
<li>Don&#8217;t use document.write, and minimise calls to external JavaScript</li>
<li>Use AJAX to load complex secondary page data out of band</li>
</ol>
</li>
<li>Reduce the overall download size
<ol>
<li>Compress text files (HTML, CSS, and JavaScript) on-the-fly with gzip</li>
<li>Minify JavaScript</li>
</ol>
</li>
<li>Speed up the backend application</li>
</ol>
<p>That looks pretty straightforward (and maybe you&#8217;re thinking that this is all a blinding flash of the obvious): that&#8217;s because it is. But these techniques can bring some great improvements, and you might find some interesting ideas in the details.</p>
<h2>Methodology</h2>
<p>Before you get started: define a goal for the optimization. Having an explicit goal means that you&#8217;ll know when you&#8217;ve finished, and it gives a target to aim towards. An example goal might be that a user, with a cold cache, can read a story on the homepage within 2 seconds, and the whole page is downloaded within 8 seconds, on an average internet connection, for your users.</p>
<p>Whatever your goal is, make it precise and clear. It&#8217;s good to have a stretch goal, but it still needs to be obtainable.</p>
<p>As with any optimization, it&#8217;s crucial to measure your initial state, and the results of each modification. Otherwise it&#8217;s impossible to quantify any improvement. Not all of these tweaks will give the same level of improvement, and depending on your environment, some might not be worth implementing. Measure, and you will be able to make a sound decision.</p>
<p>Start by recording the current download timings of the site / page you&#8217;re optimizing: the time to the first main content, time till interactive, time to download complete. Repeat each timing run at least 3 times to give some statistical significance. Time both with cold caches (a full reload) and with warm caches.</p>
<p>To give a detailed picture of what the browser is doing, use tools like <a href="http://www.getfirebug.com/" class="firebug">Firebug&#8217;s</a> network tool, <a href="http://www.xk72.com/charles/" class="charles">Charles Proxy</a> or <a href="http://www.wireshark.org" class="wireshark">Wireshark</a>, and review the server logs. It&#8217;s important to be able to watch the browser hitting the server in real-time by tailing the logs: it lets you verify that your test has been implemented correctly.</p>
<p>After each tweak, run a set of timings again, and keep a spreadsheet log of what you did and what the impact was.</p>
<h2>Reduce the number of HTTP requests</h2>
<p>On most sites, the major component of download time is not the base HTML file, but the number of subsequent HTTP requests to load the page&#8217;s supporting files: the CSS, the JavaScript, the site furniture graphics, the pictures, etc. Each of those are extra HTTP requests, and each unique request takes a relatively long time. The fewer requests to the sever that the browser has to make, the faster the page will download.</p>
<p>There is an inherent overhead in each HTTP request. It takes substantially less time to serve one 30K file than it does three 10K files. While HTTP keepalives are useful (and you should ensure they are enabled), they don&#8217;t help as much as I had expected.</p>
<h3>Combine files into libraries</h3>
<p class="left-tick">Most sites will have a few CSS files, a few JavaScript files, and certainly many graphics that make up the site furniture. <strong>Combine each file in a type into a library.</strong></p>
<p>CSS and JavaScript libraries can be simply created just by concatenating them into one combined file (each, for CSS and for JavaScript, obviously). You can quickly go from 10 or more files (that are needed before much can be shown) down to 2.</p>
<p>Some sites run a hierarchical setup where there are specific CSS and JavaScript files for each level in the hierarchy (i.e. you might have: core.css, home.css, technology.css, gadgets.css); and the browser needs to load them all to display a deep page. This might have been set up in an attempt at improving cacheability. It is nearly always better to have the browser download one specific file than it is to hope to have some already in cache (and probably have to load two or three extras anyway).</p>
<p>To keep the modularity that comes with splitting these files out by section (or business unit), keep them split in your development process, and combine them in your build process. A simple Ant task will combine them. Alternatively, use <span class="css-lib">custom code</span> to combine the files on the fly, when presented with a URL like core.css,home.css,technology.css.</p>
<p class="left-tick sprite"><strong>For images: use the CSS sprite technique</strong>. Briefly, the images are added to one larger image file, and laid out in a convenient way. A CSS background with a specific top and left offset is then used to show each specific graphic where required. This works bests for static page furniture; it is difficult to set this up for more dynamic content like news photos.</p>
<h3>Make files cacheable</h3>
<p>Once you have reduced the total number of unique files required for the page, make what remains cacheable.</p>
<p>Caches mean that files often don&#8217;t need to be downloaded at all, and the browser can do a quick check to see if a file has changed since the last time it was fetched, and not retrieve it if it hasn&#8217;t changed. And caches aren&#8217;t only in the browser: a caching proxy or CDN that&#8217;s close to the user can give strong speed improvements too, and serve files to more than one user, reducing your overall bandwidth bills.</p>
<p>But be careful not to rely on caching as a crutch: people always have to visit your site for the first time. Have a look at your revisit ratio and you&#8217;ll likely find that most people won&#8217;t have a primed cache. Caches are most useful for subsequent page loads within one user session.</p>
<p class="left-tick cache-tut"><strong>Use the Expires and cache-control max age headers for all pages</strong>, both dynamically and statically created. The TTL (time to live) that you set will depend on how often the page updates, and how quickly after it does update you want those changes reflected. 5 to 20 minutes is often appropriate. Allowing pages to be cached won&#8217;t affect your analytics or ad impressions, as these are best recorded via JavaScript hits that are set to be uncacheable.</p>
<p class="left-tick last-modified"><strong>Make dynamic pages support the if-modified-since request header</strong>, and send the last-modified date header. This enables cache validation: when a browser goes to render a page that it had cache that has gone past its TTL, it will send a GET request that is conditional on the document&#8217;s modification date. If your application doesn&#8217;t support that conditional request, it is obliged to send the full document, even if it hasn&#8217;t changed.</p>
<p>Practically, the last-modified date can be very efficiently determined for most dynamic pages by running a version of the main content query. For example, on a news site, use the date of the most recent news story in the relevant section as the last-modified date. Even if you have to run through all of the normal business logic required to generate the page, it still makes sense to short-circuit and not send the page&#8217;s HTML, but the not-modified response header instead, if the content hasn&#8217;t changed.</p>
<p class="left-tick"><strong>Use far future expiry headers on static resources</strong> (pictures, furniture graphics, CSS, and JavaScript). Setting an expiry date many years into the future means that the browser and proxies can aggressively cache the content, and won&#8217;t need to validate the cache.</p>
<p>Obviously, you will want the flexibility to update the page furniture over time. Do this by creating new versions with new filenames &#8212; build the date or version number into the file name. A beneficial side effect of this strategy is that you know that as soon as the referencing page is published, visitors will access the updated support files.</p>
<p class="left-tick cache-engine"><strong>Use the <a href="http://www.ircache.net/cgi-bin/cacheability.py">cacheability engine</a></strong> to test that you have caching and validation set up correctly.</p>
<h2>Allow progressive rendering</h2>
<p>As the browser downloads the page, give you readers something to see as soon as possible. People perceive time oddly: if they see incremental progress as the page downloads, the will often perceive this as loading much faster than a page that doesn&#8217;t show anything until it is 100% complete, even if the total download time is the same.</p>
<p>As browsers download a page, they will do their best to render the content as it comes in. But there are circumstances that make it difficult for the browser to do this.</p>
<p class="left-tick"><strong>Load CSS files at the top of the page</strong> &#8212; from within the head section. Browsers generally won&#8217;t render anything until the style sheet has been loaded, so as not to show a flash of unstyled content. The sooner the CSS is loaded, the better.</p>
<p class="left-tick">Conversely, it&#8217;s best to <strong>load JavaScript files at the bottom of the HTML</strong> &#8212; just before the closing body tag.  When a browser rendering thread comes across a JavaScript source file that has not yet been downloaded and interpreted, it must pause rendering until the JS load is complete. This is because JavaScript files may execute the document.write command, which inserts HTML at the source file&#8217;s position.</p>
<p>It&#8217;s far better to construct user interface JavaScript that can run once the page HTML has been delivered and rendered, and then the JS can make the appropriate changes to the page DOM. This also has a useful benefit in helping to keep the semantic HTML separate from the UI logic. Note that browsers tend to load JS in priority to images, so even if the JS is at the bottom of the page, it will be loaded in priority to images higher up in the source.</p>
<p class="left-tick">As a rule, <strong>don&#8217;t use document.write</strong>: it stalls the browser renderer until all JavaScript has been downloaded and evaluated, and generally benchmarks much lower than DOM HTML manipulation. Particularly don&#8217;t use an inline &lt;script&gt; tag with an external source to fetch ads: rendering will completely stall for each ad until the ad server can return the JavaScript (and this gets even worse when less reliable third party ad servers are used). This makes your page very reliant on the ad server: if it is slow or not available, your page will be collateral damage. Rather, load ads via iframes, and insert the iframe code itself at the end of the page with JavaScript: this gets them loaded asynchronously to the page.</p>
<p class="left-tick"><strong>Use different host names to increase the number of active download threads.</strong> Browsers will generally allocate 2 to 4 download threads per host. Serving static resources on different host names will encourage the browser to download more content at once. This can easily be set up with domain name wildcards and virtual hosts. Another benefit of using multiple hosts is that slow connections downloading large files won&#8217;t tie up your application server.</p>
<p class="left-tick"><strong>Check the basics</strong>: make sure that all images have height and width tags.</p>
<p class="left-tick amazon">For complex HTML that makes the browser chug, or for secondary data on the page that takes a long time on the backend to generate, consider <strong>using an AJAX method to load and display this content out of band</strong>, after the core page has been downloaded and rendered. If you use this technique, put a sized placeholder in the core HTML so that the page doesn&#8217;t abruptly re-layout when the new content is loaded in.</p>
<h2>Reduce overall download size</h2>
<p>Smaller files and overall size means people can see the content sooner, so they&#8217;re happier; they get off your network sooner, so your infrastructure can serve the next reader; and a lower bandwidth bill at the end of the month. Of course you need to trade those benefits off against the file size required to reach the required level of utility and aesthetic value for the site.</p>
<p class="left-tick deflate"><strong>Serve compressed HTML, JavaScript, and CSS files using on-the-fly gzip compression.</strong> This incurs a slightly higher server CPU load per page impression, but it gets people off your server and network much sooner. It allows you to serve more page impressions, and the user gets their content much faster.</p>
<p>Nowadays you can safely gzip all of these textual file types in modern browsers, but older browsers will need some handholding &#8212; some prefer to only have the HTML compressed. Use the Apache browser match directives to set this up.</p>
<p class="left-tick minify"><strong>Minify your JavaScript</strong> &#8212; but keep the original source around for editing and debugging. Minification, which effectively compresses the file by removing formatting (and potentially by shortening function and variable names), can bring files down to 60% of their original size. Add gzip compression to that as well and you&#8217;re looking at a serious size reduction.</p>
<p class="left-tick google">At the extreme end, <strong>minify HTML and CSS</strong> (remove HTML formatting, trim class names, omit unambiguous quotes around attributes, etc).</p>
<p class="left-tick"><strong>Check the basics</strong>:  don&#8217;t have massive pictures inline, but use thumbnails that link to the full size images.</p>
<p class="left-tick">For image intensive sites, <strong>consider not loading images until they are scrolled into view</strong>. This saves bandwidth costs on content that is never seen.</p>
<h2>Speed up your backend application</h2>
<p>I won&#8217;t go into detail here, as the biggest improvements in load time tend to be in the client-side downloads rather than the back-end application. But here are a few ideas for speeding up the backend:</p>
<p class="left-tick cdn"><strong>Use a CDN</strong> if your business model can afford it: particularly for static files and for shareable (public, not personalised) pages.</p>
<p class="left-tick memcached"><strong>Use a distributed application object cache</strong> (like <a href="http://www.danga.com/memcached/">memcached</a>) to cache SQL / CPU intensive results. A distributed cache means that you maximise how much content you can keep in cache, and thus maximise your hit ratio: each server can dedicate its otherwise free RAM to cache, which the whole cluster can share.</p>
<p class="left-tick squid"><strong>Use <a href="http://www.squid-cache.org/">Squid</a> or another caching reverse proxy</strong> if you need to quickly mitigate traffic load without baking caching into your application.</p>
<p class="left-tick perlbal"><strong>Use a load balancer</strong> (like <a href="http://www.danga.com/perlbal/">Perlbal</a>) which distributes traffic according to which server has the least number of active connections.</p>
<p>And design your apps not to use session, or put the session into your distributed cache.</p>
<h2>Conclusion</h2>
<p>These guidelines should give you a head start in speed optimising your site. The benefits are many: happier users, increased serving capacity on your network, and lower bandwidth costs.</p>
<p class="contact">Please <a href="/contact">let me know</a> if you have any other optimization suggestions, what successes you&#8217;ve had optimizing your site, or if you have any suggestions for improving this guide.</p>
<div class="rhs">
<p class="smh">And the Melbourne sister site <a href="http://www.theage.com.au/">The Age</a></p>
<p class="prject">That was a great project to work on &#8212; we had a small cross-skill team with people from Development, Graphic Design, Ad Operations, Networking, and QA, and turned out the solution in a few days.</p>
<p class="firebug"><a href="http://www.getfirebug.com">Firebug web development extension for Firefox</a></p>
<p class="charles"><a href="http://www.xk72.com/charles/">Charles web debugging proxy</a></p>
<p class="wireshark"><a href="http://www.wireshark.org">Wireshark network protocol analyzer</a></p>
<p class="css-lib"><a href="http://rakaz.nl/item/make_your_pages_load_faster_by_combining_and_compressing_javascript_and_css_files">PHP script to combine files</a></p>
<p class="sprite"><a href="http://www.alistapart.com/articles/sprites">A list apart: CSS sprites</a><br />
<a href="http://www.smh.com.au/css/img/bg_icons.gif">SMH example sprite library</a></p>
<p class="cache-tut"><a href="http://www.mnot.net/cache_docs/">Caching tutorial</a></p>
<p class="last-modified"><a href="http://www.oreilly.com/catalog/jservlet/chapter/ch03.html#14260">Java servlet programming: last modified</a><br />
<a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html">Header field definitions</a></p>
<p class="cache-engine"><a href="http://www.ircache.net/cgi-bin/cacheability.py">Cacheability engine front-end</a><br />
<a href="http://www.mnot.net/cacheability/">Cacheability overview</a></p>
<p class="amazon"><a href="http://www.amazon.com">See example on Amazon.com homepage</a></p>
<p class="deflate"><a href="http://httpd.apache.org/docs/2.0/mod/mod_deflate.html">Apache module mod_deflate</a></p>
<p class="minify"><a href="http://fmarcia.info/jsmin/test.html">Online JavaScript minifier</a></p>
<p class="google"><a href="http://www.google.com">Check the source of Google&#8217;s homepage as an example</a></p>
<p class="cdn">CDN: <a href="http://en.wikipedia.org/wiki/Content_Delivery_Network">Cache Delivery Network</a>, e.g. <a href="http://www.akamai.com/html/solutions/index.html">Akamai</a> or <a href="http://www.limelightnetworks.com/">Limelight</a></p>
<p class="memcached"><a href="http://www.danga.com/memcached/">memcached: a distributed memory object caching system</a></p>
<p class="squid"><a href="http://www.squid-cache.org/">Squid caching proxy</a></p>
<p class="perlbal"><a href="http://www.danga.com/perlbal/">Perlbal software load balancer</a></p>
<p class="contact"><a href="/contact">Contact the author (Jonathan Hedley)</a></p>
</div>
]]></content:encoded>
			<wfw:commentRss>http://jonathanhedley.com/articles/2008/04/guide-to-website-speed-optimization/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
