Fri, 08 Feb 2013
Notes from attempting to despam a wiki with git-remote-mediawiki
I just tried to despam a mediawiki instance with git-remote-mediawiki.
The idea is as follows:
- Use git-remote-mediawiki to clone the wiki into a local git repo.
- Make a list of bad users, either by skimming Special:RecentChanges, or by some other more automated means. For example, use 'git log' to get everyone since the last time you felt the wiki was clean:
git log --since='Wed Dec 5 22:57:06 2012 +0000'
(You can process that with either 'grep ^Author' and so on, or you can use an overwrought Python script I wrote.)
- Get a list of their commits:
git log --author=bad_user_1 --author=bad_user_2 --pretty="format:%H"
Here's where things start to go wrong.
You might try to revert them all:
git log --author=bad_user_1 --author=bad_user_2 --pretty="format:%H" | xargs -n1 git revert
That works great until the first merge conflict.
So then you write a wrapper script that does "git revert $1 || git revert --abort", and you can still only revert the first few hundred (out of ~800) spam edits because one of the commits causes a conflict when you try to revert it.
Why a conflict? I suspect it's because there are spam edits that I neglected to include in the revert stream. (Update: The conflict was actually a real conflict -- some kind soul on the web had already reverted a bunch of the spam edits!)
In our case, there are fairly few pages getting spammed, so it'd be simpler to 'git log' the pages we care about and revert back to the commit IDs that look clean. 'git revert' could still be useful in the case of tangled history, but (apparently) there is a limit to how useful it can be, anyway.
Oh, also:
It'd be useful to be able to create MediaWiki dump files from git-remote-mediawiki exports. That way, I could use 'git rebase -i' to clean up history. (That would break links *unless* the MediaWiki revision IDs somehow stayed constant for the revisions with the same content. Maybe that's feasible. Actually, the simplest way might be to write a tool that filters the dump file itself, rather than exporting straight from git-remote-mediawiki.)
Also also, I fixed a format string bug in git mergetool, one of my favorite little pieces of git.
P.S. In this corpus, of the IP address editors (i.e., not logged in), 0 (of 16) are spammers. About 80% of the logged-in editors are spammers. (Admittedly our wiki does require you to log in if you are posting new URLs to a page.)
Update: It is way faster if you run it with low latency to the MediaWiki server in question. It probably could be adjusted to make fewer API calls, and to make more of them in parallel.