Cleaning up Countries: Difference between revisions

From DBpedia Mappings
Jump to navigationJump to search
 
(One intermediate revision by the same user not shown)
Line 31: Line 31:
* Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg [https://github.com/dbpedia/extraction-framework/issues/296 #296] "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way.
* Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg [https://github.com/dbpedia/extraction-framework/issues/296 #296] "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way.
* But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type"
* But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type"
* How to filter out the sports organziations, eg Cricket_Samoa, IBA_Asia, Great_Britain_men%27s_national_basketball_team?
* How to filter out the sports organizations, eg Cricket_Samoa, IBA_Asia, Great_Britain_men%27s_national_basketball_team?
* How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland?
* How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland?
* How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting?
* How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting?
Line 43: Line 43:
The extraction sample doesn't have any type...
The extraction sample doesn't have any type...
http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples
http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples
== Tasks ==
From [https://github.com/dbpedia/extraction-framework/issues/296 #296], we need to:
* Merge [[Mapping_en:Infobox_Geopolitical_organization]] over to [[Mapping_en:Infobox country]] because the former is redirected to the other.
* Discriminate on one of these fields; asked the mlist how to pick one
<pre>
|org_type =          <!-- e.g. Trade bloc -->
|membership_type =  <!-- (default "Membership") -->
|membership =        <!-- Type/s and/or number/s of members -->
</pre>
* Map to the subclass GeopoliticalOrganisation
* Maybe map [https://en.wikipedia.org/w/index.php?title=Template:United_Nations&action=edit Template:United_Nations] props list1 and list2, to capture the sub-orgs.
Any takers?

Latest revision as of 11:43, 23 March 2017

Countries on EN DBpedia

select count(*) {
  ?country a dbo:Country}

returns 1694 countries. Obviously this won't do.

Let's analyze the reasons for this pollution and try to fix it.

Countries on EN Wikipedia

There are 638 Infobox_country instances. Most of them are international organizations (free trade zones, unions, etc). On the other hand, transclusion count says 1137.

The following templates redirect to Template:Infobox country:

  • Template:Infobox Countries
  • Template:Infobox Country
  • Template:Infobox Country or territory
  • Template:Infobox Geopolitical organisation
  • Template:Infobox Geopolitical organization
  • Template:Infobox Micronation
  • Template:Infobox geopolitical organisation
  • Template:Infobox geopolitical organization
  • Template:Infobox micronation
  • Template:Infobox nation

DBpedia works only with the target template (Infobox_country), so these redirects are not a problem.

Analysis Needed

  • Why are there more dbo:Country in DBpedia than "Infobox country" in Wikipedia?
  • Why "transclusion count" shows more than uses of "Infobox country"
  • Many pages have a "type" that allows us to map to a better class (eg https://en.wikipedia.org/wiki/Eurasian_Economic_Space has Type "Single market"). Eg #296 "Why Infobox_Geopolitical_organization (eg United_Nations) is mapped to Country" is resolved in this way.
  • But many other instances remain, eg United_Nations_Transitional_Authority_in_Cambodia: analyze and see if there's some other discriminator than "type"
  • How to filter out the sports organizations, eg Cricket_Samoa, IBA_Asia, Great_Britain_men%27s_national_basketball_team?
  • How about country-related articles, eg Radio_in_the_United_States, Human_trafficking_in_the_United_States, History_of_the_Jews_in_20th-century_Poland?
  • How about articles that are not even country-related, eg Comic_book_collecting, Record_collecting?
  • How about admin locations (eg cities) that are not countries, eg Russian_Dalian?
  • How about non-administrative locations, eg Reñaca_beach?

Infobox national basketball team

Great_Britain_men%27s_national_basketball_team uses template Infobox national basketball team that is not mapped in the wiki. But why does it come out as dbo:Country?

The extraction sample doesn't have any type... http://mappings.dbpedia.org/server/extraction/en/extract?title=Great_Britain_men%27s_national_basketball_team&format=turtle-triples

Tasks

From #296, we need to:

|org_type =          <!-- e.g. Trade bloc -->
|membership_type =   <!-- (default "Membership") -->
|membership =        <!-- Type/s and/or number/s of members -->
  • Map to the subclass GeopoliticalOrganisation
  • Maybe map Template:United_Nations props list1 and list2, to capture the sub-orgs.

Any takers?