Jonathan Hedley

Review: Programming Collective Intelligence

programming collective intelligence

Programming Collective Intelligence is a book about applying data mining techniques to analyse collections of data. There is submerged information in Ebay prices, in Facebook profile networks, in collections of movie reviews, in news sites, in the stockmarket; this book by Toby Segaran shows ways to extract, visualise, understand, and predict that information.

Each chapter explains and explores a different data mining algorithm, and builds up a working example in Python, while presenting different methods and parameters of the implementation. I hadn’t really worked with Python before, but found the code easy to follow, and picked up some interesting Python idioms that I haven’t seen in other languages before. Chapters end with a set of exercises to follow that build your understanding.

As you follow the examples you build up a reasonably generic code base that allows you to swap in and out different implementations, and reuse previous code to add to new applications.

The examples use live examples from the web: sites like Ebay, Facebook, and Yahoo Finance, and this makes the book more interesting and the results more visceral than some other books on the subject which use more contrived or obscure examples. Even though there is a strong web (or web 2.0) focus on the examples, the methods and the understanding is useful for a whole range of applications.

Some of the topics covered:

  • Bayesian classifiers to detect spam, or to file news articles into site sections
  • Hierarchical and k-means clustering to discover groups of similar items in massive sets
  • Euclidiean distance, Pearson Correlation Coefficient, Tanimoto Coefficient: ways to measure the distance (or difference) between items
  • Neural networks to predict user behaviour and improve search result ordering
  • Optimisation methods like hill climbing, simulated annealing, and genetic algorithms
  • Non-negative matrix factorization
  • Support vector machines and kernel methods to go where linear regression can’t

I found it exciting to read — it’s one of those books that give you a whole bunch of new ideas for things to build as you read it. The presentation is very good: no background is assumed, and it doesn’t talk down to those more experienced.

Recommended.

Copyright © 2009 Jonathan Hedley Home About Contact Feed