Cleaning up Countries: Difference between revisions
(→Tasks) |
|||
(One intermediate revision by the same user not shown) | |||
Line 31: | Line 31: | ||
* Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg [https://github.com/dbpedia/extraction-framework/issues/296 #296] "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way. | * Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg [https://github.com/dbpedia/extraction-framework/issues/296 #296] "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way. | ||
* But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type" | * But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type" | ||
* How to filter out the sports | * How to filter out the sports organizations, eg Cricket_Samoa, IBA_Asia, Great_Britain_men%27s_national_basketball_team? | ||
* How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland? | * How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland? | ||
* How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting? | * How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting? | ||
Line 43: | Line 43: | ||
The extraction sample doesn't have any type... | The extraction sample doesn't have any type... | ||
http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples | http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples | ||
== Tasks == | |||
From [https://github.com/dbpedia/extraction-framework/issues/296 #296], we need to: | |||
* Merge [[Mapping_en:Infobox_Geopolitical_organization]] over to [[Mapping_en:Infobox country]] because the former is redirected to the other. | |||
* Discriminate on one of these fields; asked the mlist how to pick one | |||
<pre> | |||
|org_type = <!-- e.g. Trade bloc --> | |||
|membership_type = <!-- (default "Membership") --> | |||
|membership = <!-- Type/s and/or number/s of members --> | |||
</pre> | |||
* Map to the subclass GeopoliticalOrganisation | |||
* Maybe map [https://en.wikipedia.org/w/index.php?title=Template:United_Nations&action=edit Template:United_Nations] props list1 and list2, to capture the sub-orgs. | |||
Any takers? |
Latest revision as of 11:43, 23 March 2017
Countries on EN DBpedia
select count(*) { ?country a dbo:Country}
returns 1694 countries. Obviously this won't do.
Let's analyze the reasons for this pollution and try to fix it.
Countries on EN Wikipedia
There are 638 Infobox_country instances. Most of them are international organizations (free trade zones, unions, etc). On the other hand, transclusion count says 1137.
The following templates redirect to Template:Infobox country:
- Template:Infobox Countries
- Template:Infobox Country
- Template:Infobox Country or territory
- Template:Infobox Geopolitical organisation
- Template:Infobox Geopolitical organization
- Template:Infobox Micronation
- Template:Infobox geopolitical organisation
- Template:Infobox geopolitical organization
- Template:Infobox micronation
- Template:Infobox nation
DBpedia works only with the target template (Infobox_country), so these redirects are not a problem.
Analysis Needed
- Why are there more dbo:Country in DBpedia than "Infobox country" in Wikipedia?
- Why "transclusion count" shows more than uses of "Infobox country"
- Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg #296 "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way.
- But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type"
- How to filter out the sports organizations, eg Cricket_Samoa, IBA_Asia, Great_Britain_men%27s_national_basketball_team?
- How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland?
- How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting?
- How about admin locations (eg cities) that are not countries, eg Russian_Dalian?
- How about non-administrative locations, eg Reñaca_beach?
Infobox national basketball team
Great_Britain_men%27s_national_basketball_team uses template Infobox national basketball team that is not mapped in the wiki. But why does it come out as dbo:Country?
The extraction sample doesn't have any type... http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples
Tasks
From #296, we need to:
- Merge Mapping_en:Infobox_Geopolitical_organization over to Mapping_en:Infobox country because the former is redirected to the other.
- Discriminate on one of these fields; asked the mlist how to pick one
|org_type = <!-- e.g. Trade bloc --> |membership_type = <!-- (default "Membership") --> |membership = <!-- Type/s and/or number/s of members -->
- Map to the subclass GeopoliticalOrganisation
- Maybe map Template:United_Nations props list1 and list2, to capture the sub-orgs.
Any takers?