Statistics, SPSS and R

October 29, 2007

As I’m currently forced to work on a computer running Windows XP at a university, I have started to explore the possibilities, the commercial programs might offer as replacements for the open source tools I’ve learned to use during the past years on Linux. Last week and today I’ve been studying how to do some cluster analysis on the software availabe in the university network. I had already installed my database on a PostgreSQL server running on my work computer – I can’t be bothered to even try things out with Access anymore, even though the current versions are probably much better than the horrible Access 97 -, so an important feature was the ability to access views and tables on the server.

First I tried Statistica, version 7. The data import worked well through the ODBC driver provided by PostgreSQL, and the user interface was surprisingly nice; not immediately accessible, but this seems to be a powerful tool. I never got to the clustering part, though, as this morning Statistica started to complain, that it has expired, and that I need to fill in a new date code. That’s that then, I’ll file a report to the support unless in works again tomorrow.

So today, I turned to SPSS (version 14). Data import was less intuitive, but worked well. The analysis methods are not as easy to use, and I was quite surprised, when the simple hierarchical clustering of ca. 500 measurements, each with ca. 30 variables, locked the computer for almost 30 minutes. Seven years ago I wrote a C++ program for my old, old laptop for a similar purpose, with 3000 measurements with 10 variables, and it lasted 20 minutes. And that was my first ever real computer program, so I had expected somewhat better performance.

I decided to test the good, old tools, and installed R on the XP. I had some trouble importing the data from PostgreSQL, until I realised, that I just have to use the same ODBC interface as with the other programs. After that, everything went quite quickly; a hierarchical, agglomerative clustering with agnes took 4 seconds, but nicer results were produced with diana (about the same time), both from the package cluster. After this, I won’t be going back to SPSS, but I might still give a try to Statistica, if it agrees to run one of these days.

Based on these quick tests, R is a much more efficient tool. The learning curve is probably quite much steeper, as using R is like shell programming, but once you learn how to use it, there’s no limit what you can do. But don’t take my opinions at face value: I don’t really know how to use SPSS, so anyone really knowing his/her way around with it is probably better source. This is just a blog, anyway…


Annotating pdf

October 24, 2007

For a long time I’ve been looking for a tool to annotate the pdf documents. The electronic versions of academic journals usually provide the articles in pdf format, and it would be nice to work with them like with the printed versions: underlining, making notes in the margin etc. This far, this has required the purchase of a full version of Adobe Reader — rarely anyone distributes their pdf-files with the “comment” property enabled, if not for anything else then for the reason, that to enable this property is not trivial.

A wonderful alternative seems to be PDF-XChange Viewer. It is as free as the Adobe product, but seems to be faster and lighter on the computer. The display quality at the XP I use at work is as good, and the biggest bonus is, that you can add your own annotation to the pdf file. These are saved with the file, and can be seen with any other pdf viewer, it seems.

You never know about the policies of individual companies, but it seems, that Tracker Software tries to do the same thing as Adobe, because the other versions of the program have more capabilites; they just offer more for nothing.

Downsides? Of course, there is no Linux version available, which is a pity. But for the time being, as I have to use XP at work anyway, I’ll be doing my readings (and annotations!) with this program.

Update on 2008-06-20:

I just tested PDF-XChange viewer on Wine after having updated my Debian testing distribution. Works well now, so even though there is no Linux version yet, the program can be used under Wine. No special setups in my case were needed.


Converting LaTeX to OpenOffice

October 23, 2007

As a happy user of LaTeX for a few years, a recurrent problem has been the sharing of my documents with other persons. In the early days I was a happy latex2rtf user, and I even contributed some minor details to its development. Quite soon it became apparent, however, that the only reasonable solution to exporting my products is something that acts like a TeX-processor.

In the cases I need to export my texts to other formats than PDF for printing, the layout is of secondary importance. Of major importance, however, is that the certain “academic” structure gets through as well as possible:

  • The footnotes have to be footnotes also in the end product
  • The bibliography must come through as produced by jurabib.

Everything other is secondary, as the articles written will be layed out by the journals, anyway, but they want the reference system to work. Ever heard of a journal in the humanities giving out its LaTeX-styles? Me neither. This means, that I have to try to reach a citation format which fills out the requirements of the journals through tweaking the options of jurabib.

This far, the only solution which actually seems to work is TeX4ht. This is a program that works by running a TeX-processor, and it has quite many output formats. Only bad thing, the documentation is quite lousy, and most of the commands are not described at all.

But I get pretty ok OpenOffice output with oolatex, although ooxelatex is better if you want to use other languages that pretty plain English. Classical Greek works fine, though… I had some trouble getting this to function, for a long while, in fact. It seems, that TeX4ht did not like the hyperref package at all; once I dropped that from the preamble, everything went nice and smoothly. The problem seemed to be related to jurabib, somehow. Probably should file a bug report, some day.

A sad thing is, that jurabib is unmaintained. Jens Berger, the guy who developed the package, cannot devote any more time to the package, so the package is frozen until someone volunteers to take it over. I wish I had the time… A replacement pointed to also by Jens is biblatex. It seems to be quite a potent too for the bibliographic needs in the humanities, but it is still beta-level and not officially released, so you can’t find it in any of the TeX distributions, yet. It seems to include many of the good features in jurabib, like fields for gender, original languages and translations — all very necessary for a historian. To the surprise of many, the hegemony of English is not nearly absolute in for example Classical Studies. French, German, Italian, even Spanish are still major languages, and a researcher unable to read any of there is bound to miss major contributions in the field; therefore, support for original language information of publications is important or people working in these fields.

But none of these really help in getting over the main problem in humanities word processing with LaTeX: the incredible backwardness of BibTeX. In a world where almost everything begins to support Unicode, BibTeX is happy only with 7-bit ASCII. As the only decent BibTeX-file editor is Emacs (IMHO), this is a major pain-in-the-ass. Who wants to keep up a bibliography, when you cannot write Köln but you have to type in K\”oln. Not too handy nor readable.

I’ve actually been running BibTeX on unicoded files happily for some time now, you just have to be very, very careful with the entry keys — better to use plain ASCII in those. This is not supposed to work, though, but luckily, it does. There are rumours (about five years old or something) about a new version of BibTeX, which might address some of the problems. Who knows, perhaps in ten or twenty years we’ll see the next version. I just think, that unless it appears soon, there won’t be many who care about it, anymore.