List of Open MediaWiki Sites
For the Graffiti Network Project we needed a list of open MediaWiki sites that would allow us to store arbitrary data. Using the Yahoo! Web Search API and a Python script for Google search results, we found over 22,000+ publically available MediaWiki installations. The key was to search for patterns that are specific to a newly installed site, such "Configuration settings list" and "MediaWiki has been successfully installed", in combination with a random word from a dictionary file (usually installed on your system as /usr/share/dict/words).
Once we found a site, our crawler inspected it by probing certain URLs to determine whether it allowed for anonymous edits, or whether it was protected by CAPTCHAs or the lame puzzle authentication plugin.
I am providing our entire list of sites collected from December 2008:
- July 11th, 2011 - Two years have passed so the data link is back online.
- May 14th, 2009 - I've been asked to remove the data set for now. Please email me if you would like a copy
Web Crawler Source
The format of the file is as follows. Some fields are blank because we were unable to find the proper information. In some cases we also found that our crawler incorrectly determined that a site was open to anonymous edits, but the site was actually used a CAPTCHA after an edit was made. We also found that some sites used a modified version of MediaWiki that had been integrated into another CMS, such as phpBB, and thus these sites did not use the default MediaWiki registration form.
- url: The base url for visiting the site.
- title: The title of the site
- generator: The meta "generator" of the software platform (often empty)
- description: The description from the meta tag (often empty)
- allow_edits: Does the site allow pages to be editted
- allow_register: Does the site allow new accounts to be registered
- require_login: Does the site require you login before making edits
- require_captcha: Does the site require you complete a visual CAPTCHA before registering an account
- require_puzzle: Does the site require you to solve a simple puzzle before registering an account
- require_email: Does the site require you to verify a valid email address before registering an account
- page_url: Base URL for the site including index.php. Use this to construct URLs
- login_url: The login URL for the site
- register_url: The Create New Account URL for the site
- last_modified: The timestamp of the last change on the crawler date
- created: The date that we discovered the site with our crawler