Using XSL to convert docx to LaTeX
March 22nd, 2011 § 6 Comments
The sudden realization that the new MS Word format, .docx, is called Office Open XML for a reason made me spend the whole day in trying to figure out, how these XSL-transformations actually work and whether they could be used in converting these new .docx files to something more edi(ta)ble.
Turned out that the XSL transformations were in principle a pretty simple thing to do, just like a friend me had told. Here’s and example of how to convert a .docx file to LaTeX, in its crudes form:
First, you need to break open the .docx file. It basically is a simple zipped archive, so an ‘unzip testdoc.docx’ should do the trick; you’ll end up with several files and sub-directories, of which only the directory called ‘word’ is necessary for this test.
Second, here’s the XSL transformation to save in a file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
<xsl:template match="/w:document">
\documentclass{article}
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="w:body">
\begin{document}
<xsl:apply-templates/>
\end{document}
</xsl:template>
<xsl:template match="w:p">
<xsl:apply-templates/><xsl:if test="position()!=last()"><xsl:text>
</xsl:text></xsl:if>
</xsl:template>
<xsl:template match="w:r">
<xsl:if test="w:footnoteReference"><xsl:text>\footnote{</xsl:text>
<xsl:call-template name="footnote">
<xsl:with-param name="fid"><xsl:value-of select="//@w:id"/></xsl:with-param>
</xsl:call-template>
<xsl:text>}</xsl:text>
</xsl:if>
<xsl:if test="w:rPr/w:b"><xsl:text>\textbf{</xsl:text></xsl:if>
<xsl:call-template name="pastb"/>
<xsl:if test="w:rPr/w:b"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>
<xsl:template name="pastb">
<xsl:if test="w:rPr/w:i"><xsl:text>\textit{</xsl:text></xsl:if>
<xsl:call-template name="pasti"/>
<xsl:if test="w:rPr/w:i"><xsl:text>}</xsl:text></xsl:if>
</xsl:template>
<xsl:template name="pasti">
<xsl:apply-templates select="w:t"/>
</xsl:template>
<xsl:template name="footnote">
<xsl:param name="fid"/>
<xsl:apply-templates select="document('footnotes.xml')/w:footnotes/w:footnote[@w:id=$fid]"/>
</xsl:template>
<xsl:template match="//w:footnote">
<xsl:apply-templates select="w:p"/>
</xsl:template>
</xsl:stylesheet>
You can save that in a file called docxtolatex.xsl in the ‘word’ directory. Then, in that directory, run ‘xsltproc docxtolatex.xsl document.xml’, and you’ll have your screen full of the document, in LaTeX markup.
You’ll notice, that this XSLT only converts bold, italics and footnotes. But then again, that’s what I often only need to convert…
[...] actual information on doing this all is located here: http://pastcounts.wordpress.com/2011/03/22/using-xsl-to-convert-docx-to-latex/ First, you need to break open the .docx file. It basically is a simple zipped archive, so an [...]
Thanks for posting this great resource! It’s super-helpful and I’ve had a good deal of success transforming .docx files into LaTeX. The only snag I’ve encountered, however, is that this series of XSLT transformations populates the content of the first footnote into all subsequent footnotes. That is, if the content of the first footnote is “Jones, 2008″ and the content of the second one is “Smith, 2008″, the content of the second footnote in the LaTeX output is “Jones, 2008″. Have you encountered this problem as well? Any chance you’ve stumbled upon a fix?
Thanks very much again! You’ve got a great blog with lots of super-useful tips.
Vivek
Hi, I encountered the same issue. In order to solve it, one must replace this part of code:
<xsl:if test="w:footnoteReference" ecc…
…
with the following:
\footnote{
}
I.e. : the xsl:if test is not necessary. Once solved this issue I have also posted the solution here: http://forum.html.it/forum/showthread.php?s=&postid=13534274#post13534274 (for italian users).
Ups…the code didn’t appear correctly, maybe it’s too long. See the link for the solution.
Your xslt works very well. However, I need to convert more complicated Word documents, containing equations, figures and tables as well. Could you point me to references and/or resources to learn more about how to modify your xslt? Also, any thoughts you might have on that would be greatly appreciated
Thanks in advance.
Andrea G.
Thanks for your comment, and I’m happy that you found the code usefuls. For me, it was just a quick test I made, and I haven’t really worked on it since. I guess if I ever continue to improve this, I’d first start with the docx file format reference that you can get from Microsoft’s www-pages; in addition to the documentation of the xsltproc, that was all I had available.
Since I’m currently quite busy with other things, I’m not really going to work on this right now.