Submissions/Wikidata Toolkit: A Java library for working with Wikidata
This is an accepted submission for Wikimania 2014. |
- Submission no. 5053
- Title of the submission
Wikidata Toolkit: A Java library for working with Wikidata
- Type of submission (discussion, hot seat, panel, presentation, tutorial, workshop)
Tutorial
- Author of the submission
Markus Krötzsch (contact author), Michael Günther (co-presenter), Julian Mendez (co-presenter)
- E-mail address
markus.kroetzschtu-dresden.de
- Username
- Country of origin
Germany
- Affiliation, if any (organisation, company etc.)
- Personal homepage or blog
- Abstract (at least 300 words to describe your proposal)
Wikidata Toolkit is a Java library that greatly simplifies using data from Wikidata or other Wikibase installations in your programs. It provides data structures to mirror all Wikibase data in Java, and convenient facilities to load, manipulate, analyse, and query such data. The primary goal of the project is to enable new and innovative applications around Wikidata, and thus to serve a wider community of developers, researchers, and practitioners that are eager to take advantage of that new data resource.
However, the project is very recent and thus not widely known yet. In fact, development is supported by an Individual Engagement Grant of the WMF that runs from February till August 2014, such that the initial funding phase only just finished at the time of Wikimania. It is therefore an ideal time to present the features, provide help to current users, and discuss next steps with the community.
This tutorial therefore provides a practical introduction to the Wikidata Toolkit for the working Java developer. The goal of this initial introduction is to explain the overall architecture and programming facilities that the library provides, and to enable participants to develop their own data-driven applications.
The planned structure of the tutorial is as follows:
- Feature overview: what Wikidata Toolkit can do for you
- Main components: which parts do you actually need
- The Wikidata data model for the working developer
- My first application: a data-driven equivalent of "Hello World"
- Towards serious applications: further examples explained
- Performance considerations: how big a machine you might need
- Wikidata Toolkit workshop: bring your own questions
The overall time being quite short, extensive hands-on sessions are not included here, but we will have a few developers around who can help with practical problems, also in the breaks after the tutorial.
Although Java is the programming language used by Wikidata Toolkit, the tutorial should also be of interest to developers working in other languages. On the one hand, the toolkit can still be a valuable resource for pre-processing data to be used in another software. On the other hand, it provides reference implementations of several key mechanisms and data structures that are useful to work with Wikidata.
- Track
- Technology, Interface & Infrastructure
- Length of session (if other than 30 minutes, specify how long)
- 30 minutes
- Will you attend Wikimania if your submission is not accepted?
Yes
- Slides or further information (optional)
- Slides for the talk
- Information about the Wikidata Toolkit is found on the project homepage
- Some relevant applications fields are outlined in the paper Wikidata: A Free Collaborative Knowledge Base (with Denny Vrandecic).
- A wider introduction to Wikidata usage for the non-technical audience is given in Submissions/How to use Wikidata: Things to make and do with 30 million statements (Open Data track). In contrast to this presentation, the present tutorial is specifically about the recent Wikidata Toolkit IEG project. The target audience here are developers and the presentation will be delivered by several developers of the Wikidata Toolkit project.
- Code used in the demo:
- The class WikimaniaExample used for the (slow version of the) iteration over all Wikidata items; this code will work on Wikidata Toolkit 0.2.0:
package org.wikidata.wdtk.examples;
import org.wikidata.wdtk.dumpfiles.DumpProcessingController;
import org.wikidata.wdtk.dumpfiles.MwRevision;
import org.wikidata.wdtk.dumpfiles.StatisticsMwRevisionProcessor;
public class WikimaniaExample {
public static void main(String[] args) {
ExampleHelpers.configureLogging();
// Controller object for processing dumps:
DumpProcessingController dumpProcessingController = new DumpProcessingController(
"wikidatawiki");
dumpProcessingController.setOfflineMode(true);
// Example processor for item documents:
WikimaniaDocumentProcessor documentProcessor = new WikimaniaDocumentProcessor();
dumpProcessingController.registerEntityDocumentProcessor(
documentProcessor, MwRevision.MODEL_WIKIBASE_ITEM, true);
// Another processor for statistics & time keeping:
dumpProcessingController.registerMwRevisionProcessor(
new StatisticsMwRevisionProcessor("statistics", 10000), null,
true);
dumpProcessingController.processMostRecentMainDump();
documentProcessor.storeResults();
}
}
- The class WikimaniaDocumentProcessor used to compute average life expectancy and to print it to a CSV file:
package org.wikidata.wdtk.examples;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import org.wikidata.wdtk.datamodel.interfaces.EntityDocumentProcessor;
import org.wikidata.wdtk.datamodel.interfaces.ItemDocument;
import org.wikidata.wdtk.datamodel.interfaces.PropertyDocument;
import org.wikidata.wdtk.datamodel.interfaces.Statement;
import org.wikidata.wdtk.datamodel.interfaces.StatementGroup;
import org.wikidata.wdtk.datamodel.interfaces.TimeValue;
import org.wikidata.wdtk.datamodel.interfaces.Value;
import org.wikidata.wdtk.datamodel.interfaces.ValueSnak;
public class WikimaniaDocumentProcessor extends Object implements
EntityDocumentProcessor {
long countItems = 0;
long populationCount = 0;
final long[] lifeSpans = new long[2100];
final long[] peopleCount = new long[2100];
@Override
public void processItemDocument(ItemDocument itemDocument) {
this.countItems++;
int birthYear = Integer.MIN_VALUE;
int deathYear = Integer.MIN_VALUE;
for (StatementGroup sg : itemDocument.getStatementGroups()) {
// P569 is "birth date"
if ("P569".equals(sg.getProperty().getId())) {
for (Statement s : sg.getStatements()) {
if (s.getClaim().getMainSnak() instanceof ValueSnak) {
Value v = ((ValueSnak) s.getClaim().getMainSnak())
.getValue();
if (v instanceof TimeValue) {
birthYear = (int) ((TimeValue) v).getYear();
break;
}
}
}
}
// P570 is "death date"
if ("P570".equals(sg.getProperty().getId())) {
for (Statement s : sg.getStatements()) {
if (s.getClaim().getMainSnak() instanceof ValueSnak) {
Value v = ((ValueSnak) s.getClaim().getMainSnak())
.getValue();
if (v instanceof TimeValue) {
deathYear = (int) ((TimeValue) v).getYear();
break;
}
}
}
}
}
if (birthYear != Integer.MIN_VALUE && deathYear != Integer.MIN_VALUE
&& birthYear >= 1200) {
if (deathYear > birthYear && deathYear - birthYear < 130) {
lifeSpans[birthYear] += (deathYear - birthYear);
peopleCount[birthYear]++;
}
}
}
@Override
public void processPropertyDocument(PropertyDocument propertyDocument) {
// TODO Auto-generated method stub
}
@Override
public void finishProcessingEntityDocuments() {
// TODO Auto-generated method stub
}
public void storeResults() {
try (PrintStream out = new PrintStream(new FileOutputStream(
"results.csv"))) {
for (int i = 0; i < lifeSpans.length; i++) {
if (peopleCount[i] != 0) {
out.println(i + "," + (double) lifeSpans[i]
/ peopleCount[i] + "," + peopleCount[i]);
}
}
} catch (IOException e) {
System.out.println("Oops");
}
}
}
- Special requests
- Must leave on Sunday, so presentation should be on Friday or Saturday if at all possible
- This talk should be given later than Lydia's Wikidata keynote and not in parallel to any Wikidata talk in the Open Data track.
Interested attendees
If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with a hash and four tildes. (# ~~~~).
- --Sannita (talk) 22:14, 31 March 2014 (UTC)
- Bene* (talk) 14:05, 1 April 2014 (UTC)
- Tpt (talk) 14:27, 4 April 2014 (UTC)
- Promelior (talk) 14:07, 31 July 2014 (UTC)
- I will be your session host Edwardx (talk) 18:02, 31 July 2014 (UTC)
- Maximilianklein (talk) 15:45, 7 August 2014 (UTC)
- Add your username here.