Submissions/Semi-automated Artificial Intelligence to assist editing: An opportunity for Wikimedia sites

From Wikimania 2014 • London, United Kingdom
Jump to navigation Jump to search

After careful consideration, the Programme Committee has decided not to accept the below submission at this time. Thank you to the author(s) for participating in the Wikimania 2014 programme submission, we hope to still see you at Wikimania this August.

Submission no. 5041
Title of the submission
  • Semi-automated Artificial Intelligence to assist editing: An opportunity for Wikimedia sites
Type of submission (discussion, hot seat, panel, presentation, tutorial, workshop)
  • Presentation
Author of the submission
  • とある白い猫
E-mail address
  • とある白い猫
Country of origin
  • Residing in Brussels, Belgium
Affiliation, if any (organisation, company etc.)
  • Volunteer on Wikimedia websites (Wikipedia, Commons, etc.)
Personal homepage or blog
  • none
Abstract (at least 300 words to describe your proposal)
  • Technology, Interface & Infrastructure
Length of session (if other than 30 minutes, specify how long)
30 minutes
Will you attend Wikimania if your submission is not accepted?
Probably Unfortunately no. My circumstances have changed so I will not be able to attend if submission is not accepted.
Slides or further information (optional)
Slides: DRAFT (3.39 MB)

Special requests


Artificial Intelligence

Breakdown of content on
Wikipedia main namespace
(as of 5 February 2012)
Due to bugzilla:61813, these numbers cannot be updated at this point
English 9,272,208 3,933,153 2,387,906 0.7952 0.6222
German 2,337,921 1,383,695 1,014,441 0.6974 0.5770
French 2,410,253 1,226,669 517,845 0.8231 0.7032
Dutch 1,474,132 1,032,487 217,583 0.8714 0.8259
Italian 1,350,753 909,979 428,606 0.7591 0.6798
Spanish 2,190,060 878,116 752,814 0.7442 0.5384
Polish 1,145,943 885,712 327,161 0.7779 0.7303
Russian 1,733,689 835,022 373,430 0.8228 0.6910
Japanese 1,279,097 803,157 205,484 0.8616 0.7963
Portugese 1,283,345 717,771 419,129 0.7538 0.6313
Breakdown of content on
Wikimedia Commons
(as of 23 February 2014)
Filetype Number
of files
(in use)
of files
jpeg 17,433,039 1,927,293
png 1,245,673 347,342
svg+xml 792,673 56,948
ogg 334,981 28,578
pdf 228,677 23,478
gif 146,747 167,973
tiff 130,910 7,914
vnd.djvu 31,830 1,758
webm 5,332 1,425
midi 3,505 448
x-flac 897 12,635
x-xcf 553 209
wav 220 119
Other (53 filetypes) - 26,235
Total 20,355,037 2,601,612

Wikimedia Commons

  • There are 114,698 galleries on Commons
  • There are 20,355,037 files on Commons
  • There are 2,601,612 deleted files on Commons
    • ~11.3327% of the existing files are deleted

Statistics by DaB. & Betacommand

Artificial Intelligence (AI) is a branch of computer science that makes use of machines/agents/computers to process information to find patterns in relationships and use this to predict how to handle future data. Artificial intelligence has grown in its use particularly in the past decade with applications ranging from search engines to space exploration.

Since its creation Wikipedia and other Wikimedia projects have relied on volunteers to handle all tasks through crowdsourcing, including mundane tasks. With the exponential increase in the amount of data and with improvements in Artificial Intelligence we are able to delegate mundane tasks to machines to a certain degree. Currently Wikimedians are dealing with an overwhelming amount of content. To better express just how much information we are dealing with currently, see the table to the right.

Key problem with Artificial Intelligence research is researchers are often not experienced Wikimedians so they do not realize the potential of tools Wikimedians know and take for granted. To give an example, only a few people outside of the circles of experienced Wikimedians know that images deleted on Wikimedia projects aren't really deleted but just hidden from public view. One researcher I talked to called the deleted image archive of Commons a "gold mine". Indeed in any kind of machine learning task classified content (in case of commons that could very well be seen as "wanted" and "unwanted" content) can lead to supervised learning. You can have a system that uses deleted content, deletion summaries, content on the deleted image description pages to determine if other similar unwanted content exists that may need to be deleted or if newer uploads are similar to deleted content. This is just one of the many examples where artificial intelligence can assist editing.

To expand on the idea, tools such as Copyscape and TinEye are not customized to specifically serve Wikimedia projects. Their general purpose accuracy as a result is limited which in turn means their use to satisfy the needs of Wikimedia projects is limited. Innovative use of AI methods such as information retrieval, text mining and image retrieval can lead to more advanced tools.

CLEF 2011

Report on CLEF 2011: Participation:Presenting at PAN Lab of CLEF 2011/Report

CLEF (Cross-Language Evaluation Forum) conference has various tracks on Artificial Intelligence on text, image and even audio mining. The conference is divided into presentations and workshops. Each workshop track has sub-tasks that diverge into more specialized fields where competing implementations are ranked. The diagram to the right could be seen as an example of one of the many Workshops.

CLEF 2011 had a participation of 174 registered participants, 52 students in other words 226 people from 29 countries or 5 continents. The international makeup of the conference CLEF utilizes scientists world-wide even though it is known to be more of a European conference. Unlike its more business oriented counterparts, CLEF is more research prone making its goals compatible with non-profit projects and organizations.

Structure of PAN

I have attended CLEF 2011 as a participant and my presence there was through a grant by Wikimedia Deutschland. Aside from presenting my own research I have spent the remainder of my time to analyze the potential it may have had for Wikimedia projects such as Wikipedia and Commons in particular. Admittedly I was quite surprised that a significant majority of researchers as well as keynote speakers stated that they made use of Wikimedia projects as a source of raw data for research purposes at some point if not for their current topic of research. Such research can generate new innovative tools to handle mundane tasks automatically or semi-automatically so that human editors have more time left to work on other tasks.

It is in my belief that with little effort CLEF could become an indispensable asset for Wikimedia Foundation related projects as researchers working for CLEF already use Wikimedia projects. Particularly PAN and ImageCLEF labs could assist in dealing with issues wikis face such as automated identification of copyrighted material (text and images), automated tagging of images (for example for the image filter already approved by the board of trustees and community through the referendum), semi-automated categorization of images. This in turn would lead to human editors having more time for other more creative tasks. Another thing to note is that foundation had practically no presence in the CLEF 2011 conference even though foundation run projects dominated discussions in practically all of the tracks.

Some Artificial Intelligence ideas for the presentation

  • General
    • Customizable image filter for individuals (where users can train the filter with category names of their own choosing for example)
  • Wikipedia
    • Copyright/Plagiarism Detection: Semi-automatic identification of copyrighted content stolen from external sources
      • A large proportion of copyright violations are automatically blanked and tagged by EN:User:CorenSearchBot on the English language Wikipedia.
    • Author Identification: Semi-automatic identification of returning banned users as well as meatpuppets
    • Vandalism Detection: Semi-automatic identification of vandalism
      • A large majority of vandalism on the English language Wikipedia is automatically screened out by the edit filters or reverted by EN:User:ClueBot_NG.
    • Disambiguation: Semi-automatic automatic identification of disambiguation links to link them to the proper page
    • Category Identification: Semi-automatic automatic categorization of articles
    • Correlate real life events: Semi-automatic automatic identification of content for current events
  • Wikisource
    • OCR for wiki: OCR developed to assist importing scanned content to Wikisource and using existing corrections to better learn OCR.
  • Wikimedia Commons
    • Unwanted: Semi-automatic identification of unwanted content (copyright violations, vandalism/trolling oriented uploads, non-project scope uploads)
    • Controversial: Semi-automatic identification of controversial content (nudity, violence)
    • Categorization: Semi-automatic categorization of images
    • Plant identification: Semi-automatic identification of plant features to assist in species identification
  • Wikimedia servers
    • Performance: Performance analysis to predict how well each server is doing, predict server problems before they go critical, identify the cause
    • Cyber Defence: Methods such as anomaly detection to identify intrusion activity on the servers
  • Wikimedia Foundation
    • Sentiment analysis of social media and the web: Datamine to identify sentiments towards the foundation itself and towards foundation decisions as a form of feedback

Interested attendees

If you are interested in attending this session, please sign with your username below. This will help reviewers to decide which sessions are of high interest. Sign with a hash and four tildes. (# ~~~~).

  1. --Base (talk) 08:06, 20 February 2014 (UTC)
  2. --Xelgen (talk) 23:40, 13 March 2014 (UTC)
  3. Gnom (talk) 19:34, 20 March 2014 (UTC)
  4. MichaelMaggs (talk) 10:38, 21 March 2014 (UTC)
  5. Rillke (talk) 13:57, 21 March 2014 (UTC) - Are you sure you only need 30 minutes? Will you show a "demo"?
    • My intention here is to primarily introduce different AI applications to problems on Wikimedia sites. I can extend the length based on demand here. Perhaps a 30 minute presentation and a Q&A session later OR perhaps it could be better for people to ask questions separately after the presentation where there wouldn't be a time limit. I'd be willing to accommodate either way. -- とある白い猫 chi? 19:26, 21 March 2014 (UTC)
  6. Ocaasi (talk) 01:49, 8 April 2014 (UTC)
  7. Micru (talk) 14:20, 15 April 2014 (UTC) - Interested in OCR
  8. EpochFail (talk) 00:20, 10 June 2014 (UTC)
  9. Add your username here.