Statistics, SPSS and R

As I’m currently forced to work on a computer running Windows XP at a university, I have started to explore the possibilities, the commercial programs might offer as replacements for the open source tools I’ve learned to use during the past years on Linux. Last week and today I’ve been studying how to do some cluster analysis on the software availabe in the university network. I had already installed my database on a PostgreSQL server running on my work computer – I can’t be bothered to even try things out with Access anymore, even though the current versions are probably much better than the horrible Access 97 -, so an important feature was the ability to access views and tables on the server.

First I tried Statistica, version 7. The data import worked well through the ODBC driver provided by PostgreSQL, and the user interface was surprisingly nice; not immediately accessible, but this seems to be a powerful tool. I never got to the clustering part, though, as this morning Statistica started to complain, that it has expired, and that I need to fill in a new date code. That’s that then, I’ll file a report to the support unless in works again tomorrow.

So today, I turned to SPSS (version 14). Data import was less intuitive, but worked well. The analysis methods are not as easy to use, and I was quite surprised, when the simple hierarchical clustering of ca. 500 measurements, each with ca. 30 variables, locked the computer for almost 30 minutes. Seven years ago I wrote a C++ program for my old, old laptop for a similar purpose, with 3000 measurements with 10 variables, and it lasted 20 minutes. And that was my first ever real computer program, so I had expected somewhat better performance.

I decided to test the good, old tools, and installed R on the XP. I had some trouble importing the data from PostgreSQL, until I realised, that I just have to use the same ODBC interface as with the other programs. After that, everything went quite quickly; a hierarchical, agglomerative clustering with agnes took 4 seconds, but nicer results were produced with diana (about the same time), both from the package cluster. After this, I won’t be going back to SPSS, but I might still give a try to Statistica, if it agrees to run one of these days.

Based on these quick tests, R is a much more efficient tool. The learning curve is probably quite much steeper, as using R is like shell programming, but once you learn how to use it, there’s no limit what you can do. But don’t take my opinions at face value: I don’t really know how to use SPSS, so anyone really knowing his/her way around with it is probably better source. This is just a blog, anyway…

2 Responses to “Statistics, SPSS and R”

  1. Ivan Says:

    Google found your blog for me: I was looking for ’spss r blog’. I have a couple of quick points:

    1. I find this post extremely reassuring.

    I am in the middle of automating some data processing, using spss and its python interface. Although the python interface is much better than using the GUI, the whole thing is pretty unpleasant (not because of the python, I love python).

    I’ve known about R for a long time but never really picked it up. In the last week or so it’s never been far from my toughts. I don’t think it would be a good idea to switch mid-project, but as soon as I’ve finished this job I’m going to do it all over again with R, just so I’m ready for next time.

    One problem might be that the client wants the result as an spss output file (*.spo).

    2. Nice blog. Is it new? Keep it up!

    2a. I notice you have an empty about page.

    2b. You have Ken McLeod’s blog on your blogroll! I love his books, especially the Fall Revolution and the Engines of Light series’.

    Best wishes

    Ivan

  2. Ralph Says:

    Good to see that R can hold its own against some of the commercial software out there. The main reason that I moved from using S-plus to R was because at that time it hogged memory and would lock the machine causing a crash if you had the cheek to want to use another application at the same time. So much for multitasking!

    The R system is more flexible and powerful once you get over the initial learning curve and are happy to use a command line system. The main complaint is about memory as the data is stored in RAM rather than on disk, but sometimes you have to wonder what people are actually going to find in these large datasets and how many spurious relationships will get identified due to the sheer volume of tests.

    SAS and SPSS are probably better systems if processing large volumes of data for standard analysis but the graphics in these systems is nothing to write home about, in particular when compared to R.

    You have some interesting posts on your blog so keep it up.

    Best wishes
    Ralph

Leave a Reply