Parsing Wikipedia

From HemlWiki

Jump to: navigation, search

[edit] Goals

The first goal of this project is to extract from the XML file of Wikipedia a large set of reliable geolocated historical events and encode these in the CIDOC-CRM RDF variant used by the Heml Fawcett tools.

[edit] Approach

  1. Processing in parallel requires that the file be split. Provided that the open and close tags <page> and </page> do not appear on the same line as each other (a condition can be tested with a grep), the following awk one-liner seems to work well: awk 'BEGIN{pageCounter=0}/<page>/ {pageCounter++; print "PAGE"pageCounter"OPEN";} {print > "page_"pageCounter;} /<\/page>/ {print "PAGECLOSE"; close("page_"pageCounter);}' input.xml. It will create a set of serially numbered file files, with each file corresponding to a <page> element within the Wikipedia XML file. The awk that comes with Ubuntu 8.0.4 is not compiled to allow files over 2 GB in size, so the entire, 19 GB Wikipedia file will not be parsed. The text can be subdivided using the unix 'head' command, or the GNU utilities can be compiled specifically for this task. In total, there will be around 6,700,000 of these files.
  1. XML-ize some of the wiki markup, especially that which is found around Infoboxes, the most rich source of structured markup for events. Replace all {{ with <doubleBraces> and all }} with </doubleBraces>. Then, indicate what each of these starts or ends with an attribute, replacing, for instance, <doubleBraces>Infobox Settlement with <doubleBraces context="Infobox_Settlement">. This avoids having to parse the nesting of the {{ markup, but still makes its features available to xpath in xslt.
Personal tools