Monday 19 October 2015

Extracting wordpress posts from xml export file

A variation of the MS Excel spreadsheet, schGoogleToolBox.xls, that I wrote for extracting posts from the Blogger xml export file. This version extracts from the wordpress xml export file, slightly more complicated as uses namespaces and CDATA elements.

The objective was to extract the individual posts from the xml file, then import into Scrivener, unfortunately wordpress does strange things with html paragraph marks.  Also as I am working on MS Windows my html that I wrap the exported data into to create a valid html file, resulted in combined DOS and Unix end of line markers. The wordpress post in the CDATA section contains only linefeed characters (LF) not carriage return (CR) and line feed,  also no paragraph <p> markers, so the files as imported into Scrivener lost paragraph spacing.

To resolve this I was just going to open and convert in UltraEdit Studio but that seemed time consuming for several files, therefore modified the vba code to replace the double LF's with, double CRLF's and double <br>  codes. This produced acceptable paragraph spacing when the html files are imported to Scrivener as and converted to plain text.

The spreadsheet also retrieves categories and tags, and classes these as keywords and counts the frequency of assignment.

It only extracts posts with post_type='post', the vba code can be modified to get static pages, or if have a woocommerce site can extract products.

Spreadsheet can be down loaded: schWordPressToolBox.xls



Revisions:
[19/10/2015] Original
[23/04/2016] Changed download links to MiScion Pty Ltd Web Store