, ,

As I’m currently forced to work on a computer running Windows XP at a university, I have started to explore the possibilities, the commercial programs might offer as replacements for the open source tools I’ve learned to use during the past years on Linux. Last week and today I’ve been studying how to do some cluster analysis on the software availabe in the university network. I had already installed my database on a PostgreSQL server running on my work computer – I can’t be bothered to even try things out with Access anymore, even though the current versions are probably much better than the horrible Access 97 -, so an important feature was the ability to access views and tables on the server.

First I tried Statistica, version 7. The data import worked well through the ODBC driver provided by PostgreSQL, and the user interface was surprisingly nice; not immediately accessible, but this seems to be a powerful tool. I never got to the clustering part, though, as this morning Statistica started to complain, that it has expired, and that I need to fill in a new date code. That’s that then, I’ll file a report to the support unless in works again tomorrow.

So today, I turned to SPSS (version 14). Data import was less intuitive, but worked well. The analysis methods are not as easy to use, and I was quite surprised, when the simple hierarchical clustering of ca. 500 measurements, each with ca. 30 variables, locked the computer for almost 30 minutes. Seven years ago I wrote a C++ program for my old, old laptop for a similar purpose, with 3000 measurements with 10 variables, and it lasted 20 minutes. And that was my first ever real computer program, so I had expected somewhat better performance.

I decided to test the good, old tools, and installed R on the XP. I had some trouble importing the data from PostgreSQL, until I realised, that I just have to use the same ODBC interface as with the other programs. After that, everything went quite quickly; a hierarchical, agglomerative clustering with agnes took 4 seconds, but nicer results were produced with diana (about the same time), both from the package cluster. After this, I won’t be going back to SPSS, but I might still give a try to Statistica, if it agrees to run one of these days.

Based on these quick tests, R is a much more efficient tool. The learning curve is probably quite much steeper, as using R is like shell programming, but once you learn how to use it, there’s no limit what you can do. But don’t take my opinions at face value: I don’t really know how to use SPSS, so anyone really knowing his/her way around with it is probably better source. This is just a blog, anyway…