Rewriting templateProperty: Difference between revisions

Revision as of 20:27, 25 February 2015

TODO: Make an EN example using Infobox Politician --VladimirAlexiev 12:12, 25 February 2015 (UTC)

Intro

The basic way the extractor works is like this:

data is extracted from template props
these are emitted as language-specific raw props, eg
- http://dbpedia.org/property/parent for EN (usual prefix dbp:)
- http://bg.dbpedia.org/property/родител for BG (usual prefix bgdbp:

* the raw data is passed through mappings templateProperty -> ontologyProperty

~~You'd think that templateProperty is the same as the raw prop name. Yeah but not always.~~

The last part (data is passed through mappings) is wrong. The mapping based extractor processes the Wikitext source, not the output of the InfoboxExtractor. A pipeline architecture would make a lot of sense, but that's not how DBpedia works Chrisahn 17:54, 25 February 2015 (UTC)

Here's what actually happens:

Wikitext is parsed into an AST
The AST is passed to several different extractors according to the configuration
Each extractor consumes AST and produces triples. The triples are not used as input for any other extractors.

Here's what the [InfoboxExtractor https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala] does:

data is extracted from template props in the AST
these are emitted as language-specific raw props, eg
- http://dbpedia.org/property/parent for EN (usual prefix dbp:)
- http://bg.dbpedia.org/property/родител for BG (usual prefix bgdbp:

Here's what the [MappingExtractor https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/MappingExtractor.scala] does:

data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty
these are emitted as generic mapping-based props, eg
- http://dbpedia.org/ontology/parent for EN, BG and any other language (usual prefix dbo:)

Wikipedia Prop Structures

Many Wikipedia templates allow creating several instances of something. Eg Listen allows a Wikipedia editor to attach up to 11 soundRecording to the subject, using "parallel" arrays of properties:

filename, filename1... filename10
title, title1... title10
description, description1.. description10

The parallelism is reflected in a numeric suffix.

Good maps take care of this, by grouping the "parallel props" in separate IntermediateNodeMappings or a similar structure that can produce an "array". Eg mapping Listen has this 11 times:

  {{IntermediateNodeMapping | nodeClass = Sound | correspondingProperty = soundRecording | mappings =
    {{ PropertyMapping | templateProperty = type          | ontologyProperty = dc:type }}
    {{ PropertyMapping | templateProperty = filename1     | ontologyProperty = filename }}
    {{ PropertyMapping | templateProperty = title1        | ontologyProperty = title }}
    {{ PropertyMapping | templateProperty = description1  | ontologyProperty = description }}
  }}

Now consider Politicians. They may hold several Positions, each over several Mandates (they are nasty that way). For each Position>Mandate (say 5*3=15), there's a bunch of props such as party, predecessor, successor, colleagues (eg vicePresident, governor...), years the subject came to that position, years the colleagues came to their respective positions, etc.

Eg see prop names of Държавник_инфо, but that's not the complete story: there's also трети_мандат_* ("third mandate" fields) etc.

If the 2D data arrays below the photos of Rosen Plevneliev and Angela Merkel don't strike your fancy, check out one of them Socialists that ruled for 40 years: Тодор_Живков

See a full list of props and an incomplete attempt to group them all at Mapping Държавник_инфо.

Wikidata editors were at a loss to create meaningful two-dimensional parallel arrays of names, so they created parasitc prefixes & suffixes that are not so easy to match up. Eg there are 10 props "предшестванОт", all mapped to "predecessor" but in different groups:

 предшестван от
 предшестван от2
 предшестван от3
 втори_мандат_предшестван от
 втори_мандат_предшестван от2
 втори_мандат_предшестван от3
 трети_мандат_предшестван от
 ...

The prefixes may have any form
The suffixes are digits, optionally followed by letters