Rewriting templateProperty: Difference between revisions
(→Intro) |
(→Intro) |
||
(One intermediate revision by the same user not shown) | |||
Line 24: | Line 24: | ||
* The triples are not used as input for any other extractors. | * The triples are not used as input for any other extractors. | ||
Here's what the [ | Here's what the [http://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/InfoboxExtractor.scala InfoboxExtractor] does: | ||
* data is extracted from template props in the AST | * data is extracted from template props in the AST | ||
Line 31: | Line 31: | ||
** http://bg.dbpedia.org/property/родител for BG (usual prefix [http://prefix.cc/bgdbp bgdbp:] | ** http://bg.dbpedia.org/property/родител for BG (usual prefix [http://prefix.cc/bgdbp bgdbp:] | ||
Here's what the [ | Here's what the [http://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/MappingExtractor.scala MappingExtractor] does: | ||
* data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty | * data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty |
Latest revision as of 21:44, 25 February 2015
WARNING: Work in progess. The content of this page is largely incorrect.
TODO: Make an EN example using Infobox Politician --VladimirAlexiev 12:12, 25 February 2015 (UTC)
Intro
The basic way the extractor works is like this:
- data is extracted from template props
- these are emitted as language-specific raw props, eg
- http://dbpedia.org/property/parent for EN (usual prefix dbp:)
- http://bg.dbpedia.org/property/родител for BG (usual prefix bgdbp:
* the raw data is passed through mappings templateProperty -> ontologyProperty
You'd think that templateProperty is the same as the raw prop name. Yeah but not always.
The last part (data is passed through mappings) is wrong. The mapping based extractor processes the Wikitext source, not the output of the InfoboxExtractor. A pipeline architecture would make a lot of sense, but that's not how DBpedia works. Chrisahn 17:54, 25 February 2015 (UTC)
Here's what actually happens:
- Wikitext is parsed into an AST (abstract syntax tree)
- The AST is passed to several different extractors according to the configuration
- Each extractor processes the AST and produces triples
- The triples are not used as input for any other extractors.
Here's what the InfoboxExtractor does:
- data is extracted from template props in the AST
- these are emitted as language-specific raw props, eg
- http://dbpedia.org/property/parent for EN (usual prefix dbp:)
- http://bg.dbpedia.org/property/родител for BG (usual prefix bgdbp:
Here's what the MappingExtractor does:
- data is extracted from template props in the AST and passed through mappings templateProperty -> ontologyProperty
- these are emitted as generic mapping-based props, eg
- http://dbpedia.org/ontology/parent for EN, BG and any other language (usual prefix dbo:)
Wikipedia Prop Structures
Many Wikipedia templates allow creating several instances of something. Eg Listen allows a Wikipedia editor to attach up to 11 soundRecording to the subject, using "parallel" arrays of properties:
- filename, filename1... filename10
- title, title1... title10
- description, description1.. description10
The parallelism is reflected in a numeric suffix.
Good maps take care of this, by grouping the "parallel props" in separate IntermediateNodeMappings or a similar structure that can produce an "array". Eg mapping Listen has this 11 times:
{{IntermediateNodeMapping | nodeClass = Sound | correspondingProperty = soundRecording | mappings = {{ PropertyMapping | templateProperty = type | ontologyProperty = dc:type }} {{ PropertyMapping | templateProperty = filename1 | ontologyProperty = filename }} {{ PropertyMapping | templateProperty = title1 | ontologyProperty = title }} {{ PropertyMapping | templateProperty = description1 | ontologyProperty = description }} }}
Now consider Politicians. They may hold several Positions, each over several Mandates (they are nasty that way). For each Position>Mandate (say 5*3=15), there's a bunch of props such as party, predecessor, successor, colleagues (eg vicePresident, governor...), years the subject came to that position, years the colleagues came to their respective positions, etc.
Eg see prop names of Държавник_инфо, but that's not the complete story: there's also трети_мандат_* ("third mandate" fields) etc.
- If the 2D data arrays below the photos of Rosen Plevneliev and Angela Merkel don't strike your fancy, check out one of them Socialists that ruled for 40 years: Тодор_Живков
See a full list of props and an incomplete attempt to group them all at Mapping Държавник_инфо.
Wikidata editors were at a loss to create meaningful two-dimensional parallel arrays of names, so they created parasitc prefixes & suffixes that are not so easy to match up. Eg there are 10 props "предшестванОт", all mapped to "predecessor" but in different groups:
предшестван от предшестван от2 предшестван от3 втори_мандат_предшестван от втори_мандат_предшестван от2 втори_мандат_предшестван от3 трети_мандат_предшестван от ...
- The prefixes may have any form
- The suffixes are digits, optionally followed by letters
Rewriting templateProperty
The parasitic prefixes/suffixes encode important info about the grouping of props, but that info is not transmitted in any clear way.
Assume a mapping fragment like this, extracting data for resource bgdbr:Тодор_Живков
{{IntermediateNodeMapping | nodeClass = CareerStation | correspondingProperty = careerStation | mappings = {{ PropertyMapping | templateProperty = втори_мандат_предшестван от3 | ontologyProperty = predecessor }}
What the extractor really does is:
- No it doesn't. See above. Chrisahn 18:06, 25 February 2015 (UTC)
- Takes data from the templateProperty provided (as expected)
- Strips parasitic prefixes & suffixes from the templateProperty (maybe unexpected) and converts to camelCase
- Emits the data using the original subject and this rewritten templateProperty, eg:
bgdbr:Тодор_Живков bgdbp:предшестванОт
- Makes an IntermediateNode and connects it with correspondingProperty (as expected), eg:
bgdbr:Тодор_Живков dbo:careerStation bgdbr:Тодор_Живков__1
- Emits the data using the IntermediateNode and the ontologyProperty as provided (as expected), eg;
bgdbr:Тодор_Живков__1 dbo:predeccessor
This achieves several goals:
- the general semantics of the raw property is preserved, but not its grouping
- the grouping is preserved by the creation of IntermediateNodes that use mapped properties (if the mapping is good)
This allows you to make queries such as:
- all predecessors of Тодор_Живков lumped together (regardless of the position). This works even if these raw props are not mapped!
select * {bgdbr:Тодор_Живков bgdbp:предшестванОт ?pred}
- all predecessors of Тодор_Живков, paired with successors, and the corresponding position name (office). (Note: you may want to throw in some OPTIONALs)
select * {bgdbr:Тодор_Живков dbo:careerStation [dbo:predecessor ?pred; dbo:successor ?succ; dbo:office ?office]}
Neat!
NOTE Currently only purely numeric parasitic suffixes are stripped. Prefixes and alphanumeric suffixes would be stripped after issue #317 is implemented