How to add a mapping namespace: Difference between revisions
(→Extract data from Wikipedia dump file: InfoboxExtractorConfig.extractTemplateStatistics must be true) |
|||
Line 142: | Line 142: | ||
==== Extract data from Wikipedia dump file ==== | ==== Extract data from Wikipedia dump file ==== | ||
Download the latest dump for language xx. | Download the latest dump for language xx. [https://github.com/dbpedia/extraction-framework/wiki/Extraction-Instructions see here for details] | ||
<pre> | |||
dump> ../run download config={download-config-file} | |||
Run RedirectExtractor, InfoboxExtractor and TemplateParameterExtractor. [ | Run RedirectExtractor, InfoboxExtractor and TemplateParameterExtractor. [https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.stats.properties dump/extraction.stats.properties] should already contain the correct settings. cd into directory dump/ and execute | ||
<pre> | <pre> | ||
dump> ../run extraction extraction.stats.properties | dump> ../run stats-extraction extraction.stats.properties | ||
</pre> | </pre> | ||
Revision as of 07:19, 5 February 2014
As an example, we use the fictitious language Xxyzish with Wikipedia domain xx and Wikipedia rank 44.
CAUTION: several subtle code changes will be needed to accomodate language codes that contain a dash (e.g. roa-rup or be-x-old), especially in regard to URLs, file names and other identifiers, also in parts of the code base not listed here. In this case, please update the code and this guide.
TODO: Many links on this page still point to the Mercurial repository on SourceForge, but the DBpedia source code now lives on GitHub.
Get language code and rank
Get the wiki language code and rank from the list of Wikipedias.
Namespace number: multiply the rank by 2 and add 200
Example: language code "xx", rank 44, namespace number 288.
CAUTION: If the calculated namespace number already exists for another language (because the ranking has changed) do not change the existing namespace number. Please find a neighboring or close enough number that works.
Example: if 288 is in use, we choose some other number that is not used, let's say 298.
Update mappings wiki
Update MediaWiki settings
Log onto the machine that is running this mappings wiki, i.e. serving http://mappings.dbpedia.org/index.php URLs.
Open LocalSettings.php. Add the following snippet in the correct alphabetical position in the map defining the extra namespaces:
"xx"=>288,
Restart the Apache server.
Add the mappings main page
Edit https://mappings.dbpedia.org/index.php/Mapping_xx. The page content should be the following, where Xxyzish is the English name of the language:
{{Mapping main page|xx|Xxyzish}}
Update mappings wiki sidebar
Edit MediaWiki:Sidebar. Add a link for the new language in the correct alphabetical position:
** Mapping xx|Mappings (xx)
Update datasets overview
Edit DBpedia datasets. Add a column for the new language in the correct alphabetical position and update all rows according to the settings in dump/extraction.default.properties. This is probably the most tedious part...
Update the extraction framework
Edit Namespace.scala
Edit your copy of core/src/main/scala/org/dbpedia/extraction/wikiparser/Namespace.scala. Add something like this in the correct alphabetical position:
"xx"->288,
Edit extract.default.properties
Edit your copy of dump/extract.default.properties. Add something like this in the correct alphabetical position:
extractors.xx=MappingExtractor
You can add more extractors, but make sure that the required configuration exists for the new language.
Update namespace settings for mappings wiki
To update the namespace settings for the mappings wiki, cd to core/ and run
../clean-install-run generate-settings
Commit changes
Commit and push the changes to default branch.
Update and restart the mapping server
Log onto the machine that is running the mapping server, i.e. serving http://mappings.dbpedia.org/server/ URLs.
Stop the server:
sudo /etc/init.d/mappings-server stop
Or, if there's no start/stop script:
ps axfu | grep java
Look for class ...server.Server, and then:
kill <process id>
Add a dummy file extraction_framework/server/src/main/statistics/mappingstats_xx.txt with the following content (make sure there are two empty lines at the end!):
wikiStats|xx redirects|0 templates|0
Then update and compile the server:
cd extraction_framework git pull mvn clean install --projects core,server
Finally, start the server:
sudo /etc/init.d/mappings-server start
Or, if there's no start/stop script:
cd extraction_framework/server ../run server &>server-<YYYY>-<MM>-<DD>.01.log &
Generate and deploy statistics
Extract data from Wikipedia dump file
Download the latest dump for language xx. see here for details
dump> ../run download config={download-config-file} Run RedirectExtractor, InfoboxExtractor and TemplateParameterExtractor. [https://github.com/dbpedia/extraction-framework/blob/master/dump/extraction.stats.properties dump/extraction.stats.properties] should already contain the correct settings. cd into directory dump/ and execute <pre> dump> ../run stats-extraction extraction.stats.properties
Extract statistics from triples files
cd into directory server/, modify the path to the dump base dir in pom.xml if necessary and run
server> ../run stats
Copy src/main/statistics/mappingstatistics_xx.txt to same folder on the mappings server.
Update and deploy sprint stuff
Ask Pablo how to do that...