Using XSL to convert docx to LaTeX
March 22nd, 2011 § 6 Comments
The sudden realization that the new MS Word format, .docx, is called Office Open XML for a reason made me spend the whole day in trying to figure out, how these XSL-transformations actually work and whether they could be used in converting these new .docx files to something more edi(ta)ble.
Turned out that the XSL transformations were in principle a pretty simple thing to do, just like a friend me had told. Here’s and example of how to convert a .docx file to LaTeX, in its crudes form:
First, you need to break open the .docx file. It basically is a simple zipped archive, so an ‘unzip testdoc.docx’ should do the trick; you’ll end up with several files and sub-directories, of which only the directory called ‘word’ is necessary for this test.
Second, here’s the XSL transformation to save in a file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<xsl:template match="/w:document">
\documentclass{article}
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="w:body">
\begin{document}
<xsl:apply-templates/>
\end{document}
</xsl:template>
<xsl:template match="w:p">
<xsl:apply-templates/><xsl:if test="position()!=last()"><xsl:text>
</xsl:text></xsl:if>
</xsl:template>
<xsl:template match="w:r">
<xsl:if test="w:footnoteReference"><xsl:text>\footnote{</xsl:text>
<xsl:call-template name="footnote">
<xsl:with-param name="fid"><xsl:value-of select="//@w:id"/></xsl:with-param>
</xsl:call-template>
<xsl:text>}</xsl:text>
</xsl:if>
<xsl:if test="w:rPr/w:b"><xsl:text>\textbf{</xsl:text></xsl:if>
<xsl:call-template name="pastb"/>
<xsl:if test="w:rPr/w:b"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>
<xsl:template name="pastb">
<xsl:if test="w:rPr/w:i"><xsl:text>\textit{</xsl:text></xsl:if>
<xsl:call-template name="pasti"/>
<xsl:if test="w:rPr/w:i"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>
<xsl:template name="pasti">
<xsl:apply-templates select="w:t"/>
</xsl:template>
<xsl:template name="footnote">
<xsl:param name="fid"/>
<xsl:apply-templates select="document('footnotes.xml')/w:footnotes/w:footnote[@w:id=$fid]"/>
</xsl:template>
<xsl:template match="//w:footnote">
<xsl:apply-templates select="w:p"/>
</xsl:template>
</xsl:stylesheet>
You can save that in a file called docxtolatex.xsl in the ‘word’ directory. Then, in that directory, run ‘xsltproc docxtolatex.xsl document.xml’, and you’ll have your screen full of the document, in LaTeX markup.
You’ll notice, that this XSLT only converts bold, italics and footnotes. But then again, that’s what I often only need to convert…
How to construct a collection of articles with LaTeX
December 20th, 2010 § Leave a Comment
I had a need to edit a collection of articles and to turn it into a book. This is not a use case covered by any of the standard LaTeX classes, so I did look for some other options. The class combine seemed to provide what I needed, but in the end, turned out to be too limiting. While seemingly offering all one could ask, it at the same time actually limited the possibilities one could do with the document to such an extent, that it was impractical to continue using it. The final acceptance of this came when I tried to create two indexes for the whole book. It just did not work.
The same effect is achieved more or less with code snippet below. It uses ideas from the combine.cls, but in a much more simplified manner. This is very simple to use. It creates a new environment papers that can be used to enter the individual papers. I have assumed that each paper is included as an individual document; this means, that each paper can have its own \documentclass and \usepackage command, and each paper can be edited individually. To include the papers in the main document, use it like this:
\begin{papers}
\include{paper1}
\include{paper2}
\end{papers}
Where each of the papers is of the form:
\documentclass[languages,paper_sizes,etc]{article/scartcl/whatever}
\usepackage{necessary}
\usepackage{cool}
\begin{document}
\title{A paper of the best practices}
\author{Johnny B. Good}
\maketitle
\section{Introduction}
In this paper \ldots
\end{document}
Note that the preamble of the included documents should be kept as simple as possible, as for example refedinitions of commands are not ignored and so on. If the included documents need any special packages, they have to be included in the preamble of the main document.
The code below redefines \documentclass, \usepackages and the document environment to be ignored, and the \maketitle command so that the titles of the paper are printed and added to the table of contents and to the marks for headers. How they are used is a matter of your pagestyle.
% This code includes small pieces from combine.cls and some other sources.
% No guarantees of any kind given. Use at your own discretion.
% Feel free to modify and distribute in any way you see fit.
\long\def\symbolfootnote[#1]#2{\begingroup%
\def\thefootnote{\fnsymbol{footnote}}\footnote[#1]{#2}\endgroup}
\makeatletter
\newenvironment{papers}{\renewcommand{\documentclass}[2][]{}%
\newcommand{\@@title}{}%
\newcommand{\@@subtitle}{}%
\renewcommand{\usepackage}[2][]{}%
\renewenvironment{document}{\begingroup}{\endgroup}%
\newcommand{\title}[1]{\renewcommand{\@title}{##1}\begingroup%
\renewcommand{\thanks}[1]{}\protected@xdef\@@title{##1}
\endgroup}
\renewcommand{\subtitle}[1]{\renewcommand{\@@subtitle}{##1}}%
\newcommand{\author}[1]{\renewcommand{\@author}{##1}}
\newcommand{\thanks}[1]{\symbolfootnote[1]{##1}}
\newcommand{\maketitle}{
\thispagestyle{empty}
\vspace*{5\baselineskip}
\begin{center}\sf
\LARGE\@title%
\ifthenelse{\equal{\@@subtitle}{}}{%emtpy subtitle
}{%not empty subtitle
\bigskip
\Large\@@subtitle
}
\bigskip
\large\@author%
\end{center}%
\markboth{\@@title}{\@author}%
\addcontentsline{toc}{chapter}{\MakeUppercase{\@author}\\\@@title}%
\setcounter{footnote}{0}\noindent%
\ignorespacesafterend}
}{
}
Happy LaTeXing!
LaTeX and math symbols in text fonts
August 10th, 2009 § 2 Comments
While doing the layout of a small magazine in Finnish, I’ve for a long time already been using unicode with LaTeX. It is just so much easier to write everything using unicode under Emacs, and then let LaTeX/TeX take care of the rest. Mostly, this even works quite well, but occasionally there are some surprising quirks. Like yesterday, when I was doing the layout for the next issue to appear in a few weeks. There was an article about archaeological dating methods, and a few paragraphs dealt with radiocarbon dating, which is characterized by the calibration process required, and the uncertainty in the results, usually expressed with something like 225 AD ± 75 years. Now, of course LaTeX has a symbol for this, you can easily get is using
$\pm$
Now, the problem with this is, that in the middle of the text, using the face Garamond,with old style figures, there suddenly appears a mathematical symbol in TeX’s own math font, which is naturally quite beautiful, but does not really look that nice within the surrounding Garamond environment. Especially since the Garamond I’m using does include a glyph of its own for the plusminus symbol. Now, since I’m working with Unicode (or, to be more specific, text encoded as UTF-8), the natural choice was to find out, how to enter the corresponding symbol using Emacs (‘_’ + ‘+’ in the input-method latin-9-prefix).
Now, to make LaTeX handle UTF-8, I’ve been using the package inputenc like this:
\usepackage[utf8x]{inputenc}
I have read somewhere, the one actually should use the option “utf8″ instead of “utf8x” as it is better supported or something, but in practice, “utf8″ never works well, and alway, even for simple texts, calls for me to enter some kind of declarations for special characters and so on, so I’ve been sticking to “utf8x” this far. Now, one would expect, that using this setup, when LaTeX encounter the UTF-8 encoded ± in the text, it would find the corresponding glyph in the font and use that. But no, that is not what happens; instead, even though using ± in the middle of the text, LaTeX still finds the glyph in the TeX math font. Why?
Well, that’s because of this code in the file uni-0.def:
\uc@dclc{177}{default}{\ensuremath{\pm}}%
As we see, it forces the math mode on, and thus ensures, that this symbol is always taken from the math fonts, no matter whether the text font has it or not!
Now, a remedy: use the package textcomp, which has the command \textpm. That picks the right glyph from the right font! This is quite stupid though, because the whole point in using unicode is not having to use these LaTeX commands to arrive at special characters.
And it remains to be seen, whether the option “utf8″ to inputenc would give better results in this case. Perhaps I’ll test that at some point.
Emacs, flyspell-mode and “centralized”
June 12th, 2009 § Leave a Comment
It was a rather irritating fight, today. It’s been a while since I last tried Emacs’ flyspell mode, which is supposed to check you writing on-the-fly, as the name implies. It works quite well, yes, but I soon got irritated by all the suggested corrections for “-ize” endings, like “centralized” should be replaced by “centralised”, which looks sort of unnatural to me. I had to check the Oxford English Dictionary, which does not even recognize the form “centralise”, and this made me pretty irritated!
Google came to help, or so it seemed. I ended up looking at wordlist packages for Debian etc., but lets cut the story short. The default spell checking program used by flyspell-mode is “ispell”, which is a venerable program, functional, but outdated by, for example, aspell — which then suddenly had tens of various dictionaries available, amongst which I found various with -ize suffixes.
Now, the only question that remains is, why does Emacs Ispell-mode default to ispell instead of aspell?
A Very Small Example of Applicative Functors in Haskell
January 29th, 2009 § Leave a Comment
This is to document a small, working example of how applicative functors can be used in Haskell.
import Control.Applicative f1:: Int -> Int -> Int f1 x y = 2*x+y main = do return $ show $ f1 <$> (Just 1) <*> (Just 2)
A very short explanation follows.
On line 1, the necessary base library module is imported
On lines 3 and 4, a small function from two integers to one integer is defined
On line 6, it is shown, how the applicative functors are used to apply the function on Maybe- values.
Archives of European Sociology
May 9th, 2008 § 4 Comments
This time I’ll be writing something more connected to history than computers, but it has pretty much to do with modern technology, anyway.
If you do a Google search for the journal title “Archives of European Sociology”, you’ll get a long list of citations to a journal by that name, like “Brubaker, Rogers, Ethnicity without Groups, Archives of European Sociology, 18, 2 2002″. Now, my wife, who happens to be a historian, tried to find this journal, as she wanted to see an article published there. Based on the huge amount of citations to the journal found by Google, she of course assumed, that the journal was well-known, widely distributed, and logically, available at the local university library. To her big surprise, she was not able to find the journal in any library database — the closest match was the Archives européennes de sociologie, and international publication, that also had an english title (European Journal of Sociology) and a German title (Europäisches Archiv für Soziologie). That obviously could not be it, as the Journals home page at the publishers site very clearly stated the title in all three languages.
But, in the end, a comparison of the references to the mysterious journal with the table of contents -data at the publishers site did show, that this was, after all, the mysterious Archives of European Sociology. Why on earth was it always referred to under this name, when the publisher, and the journal itself, very clearly used the english title European Journal of Sociology? The thing remained a mystery, until today, when she found one potential explanation for this misnomer:
More and more, the scientific journals have been adopting the convention, that the headers and footers of the pages include, in addition to the page number, the authors name and a part of the title, also a reference to the journal itself: the name of the journal, the year, volume and number of the current issue, and the pages covered by the article in question; most often this information appears in the footer of the first page of each article. This is a wonderful habit, as it saves the nerves of so many academics fervently copying articles and trying to sort the piles of copies later. The Archives européennes de sociologie had adopted this policy already in the yearly 1980′s. And you know what? They did not want to print the whole name of the journal in the footer, as it was rather long, and would have forced the footer to extend to the second line; instead, they used an abbreviation — also a venerable habit. The abbreviation was Arch. europ. sociol.
One can just imagine, how the academic sorting his piles of photocopies finds this interesting article he did remember having somewhere, is convinced of its value, and decided to cite it in his/her next work. But where did the article come from? Luckily the necessary information is included in the photocopy itself: “Arch. europ. sociol.” Now, if you’re mother tongue is English, and you have this article written in English from a journal, the name of which is abbreviated thus, what is the logical English name that can be constructed from that: “Archives of European Sociology”.
The wonders of dual-core systems
March 12th, 2008 § Leave a Comment
Just a quick note, today. My personal way of working is close to some kind of multitasking. As I personally have only one processor (“the brain”), things do come out sequentially, but I am quite convinced, that the brain actually operates many parallel threads at the same time. Of course, most of this are independent, autonomous processes connected with the bodily functions etc., but some threads are obviously some kind of sub-conscious analytical processes, to which the consicous mind passes pressing problems to be processed without any external disturbances. Then at some point, when your conscious level of the brain has reached a point in its progress, to which these sub-conscious processes have some relevance, they send an interrupt and require attention.
Practically, what this means, is that the ideas, what to do with the data often appear in the middle of the writing process itself, and force me to divert my attention from producing as much text as possible to performing some new analysis with the data. Now, with the traditional, single core systems this used to mean a halt in the writing. No matter how nice you tell your GRASS module to be, it still seems to bog down the system while doing something with your ~30M cell raster. Now, with the modern nay dual-core processors, this does not happen any more — the GRASS module takes everything the second core has to offer, while I can still keep happily using the first one to all the other things.
Especially nice this is on my work computer, as I run GRASS on Debian as a VirtualBox virtual machine. VirtualBox occupies the second core completely, while the rest is left to the actual OS (XP, in my case) to run my Emacs and other stuff. Probably with Quad-core I would feel like wasting the cores, but for my purposes, a Dual core system is very nice indeed.
pdfLaTeX and a0poster
February 22nd, 2008 § Leave a Comment
Just a quick note to record something I just found out, while planning my first poster ever. Of course the first choice, as always when doing layout, is to see, whether it could be done with LaTeX. I just hate the idea of using some WYSIWYG horror, where everything will be inconsistent anyway, no matter how hard you try to Set Things Right.
The natural choice for a poster seems to be the a0poster class, which is designed to set LaTeX properly up for making these huge pages, especially regarding the font and paper sizes. To my surprise, the paper size did not work with pdfLaTeX, but the result was big text on an A4. From some source (lost the site already) I found out, that supposedly the path LaTeX -> PS -> PDF should work, but I’d rather not go there, things are complicated enough when forced to work with XP.
Luckily, the use of geometry package saved me: just include geometry at the beginning of your document with the option a0paper, and the resulting document is A0. Voilà!
XMonad and Gnome
February 1st, 2008 § Leave a Comment
This has nothing to do with the theme of the blog, but I just happened to find a wonderful window manager for my laptop. I’m generally using Gnome, since I like the looks, but I’ve never really liked Metacity, the default window manager in Gnome. On the other hand, I’ve lately been studying Haskell, a functional computing language. A wonderfully weird experience.
There just happens to be this new window manager written completely in Haskell, XMonad, which is a so-called tiling window manager — no empty spaces on your screen any more. I just had to try it, and believe me, it is very nice to use indeed. I wanted to integrate is with my Gnome system, and the panels especially, but that was not so easy. First I had to get very recent versions of XMonad — happily there are Debian packages, but not on the official repo — and then I had to figure out, how to configure XMonad so, that it
- ignores the gnome panels on my screen;
- Leaves empty space for the panels.
The instructions on the XMonad site were for the old versions and did not really completely work. The installation goes according to the instructions on that page, but what the page lacks is a complete, working example of the configuration. Here’s one:
You’ll have to create the file
~/.xmonad/xmonad.hs
where to put your configuration. A sample content for this file:
import XMonad
main = xmonad $ defaultConfig
{ defaultGaps = [(24,24,0,0)],
manageHook = composeAll
[ className =? "MPlayer" --> doFloat
, className =? "Gimp" --> doFloat
, resource =? "gnome-panel" --> doIgnore ]
}
That gives you an empty space of 24 pixels at the top and bottom of the screen, and ignores gnome-panel. In addition, it lets Gimp and MPlayer float above other windows. Some fiddling with the session setup (see the link to the instructions) and this file in place, everything works fine, at least with XMonad 0.6. I haven’t tried the 0.5 series, so I cannot say, whether this works with them, and the versions before 0.5 use adifferent kind of a configuration scheme, so this does not work for those.
Update on 2008/05/01:
There is a page in the HaskellWiki on XMonad and Gnome, which contains all the information on this page plus much more. Go there!
Word to LaTeX
January 30th, 2008 § 1 Comment
For years now, I’ve been using a venerable old tool, rtf2latex, to convert documents from MS Word to LaTeX. For my purposes it has served well: the purpose being mostly to transform submitted articles into something, that can then made to conform to my LaTeX-style for the journal whose layout I’m doing. In practice the needs are: keep the footnotes intact, let italics, bold and underline survive the translation. Nothing else is needed, as I trust LaTeX to do rest.
I’ve felt strangely uneasy about using rtf2latex lately, though. The fact that is is no more available in Debian has made me look for alternatives. Also, the Word documents I received I still had to translate first from .doc to RTF. wvWare seemed the proper alternative, but it does not work with footnotes at all, and its web page says, that its use for this purpose is deprecated in favour of Abiword.
Abiword I use occasionally, it is your typical Gnome program. Not too complicated, works well, and is nice to look at. But this conversion function I was never able to get working, until today, when I realised, that I need to install also the abiword-plugins… how stupid of me. Now the conversion from MS .doc to latex works well, although the resulting documens is slightly too fancy to me. I’d be happy with something that preserves only the logic of the markup, and discards all of the funny spaces that are used to make it look like a Word document (Why on earth would anyone want that?)
But I guess I can finally stop worrying about not accidentally removing rtf2latex from my system. A replacement has been found! And although from the web page and release history it might seem, that Abiword is a dead project, the traffic on the development mailing list demonstrates, that the project is very much alive. I guess we will have the version 2.6 someday — not that there’s really anything wrong with 2.4.6.