I have heard, both in meatspace and online, that Wikisource is mysterious and hard to understand. I don’t agree with this but I’ve been involved with the project for years now so I may have just been institutionalised.

While there are many Wikimedians who do not recognise the project,[1] I think most get the gist that it is a digital library. I’ve heard “Like Project Gutenberg but on a wiki” and “Wikipedia library”. Neither are ideal but they are close enough. Wikisource is actually one of the largest projects — about third in page count after Wikipedia and Wiktionary, probably fourth overall if Commons is bodged in as second — so it can be assumed that it is less obscure than Wikibooks et al. Nevertheless the confusion seems to persist.

There are other projects to promote and document Wikisource, so I thought I would try a different tack and explain by example.

One of my pet projects is the transcription of pulp magazines. So far, this includes some issues of Amazing Stories, Avon Fantasy Reader and Weird Tales.  Actually, I intend this to be a slightly wider project but it’s mainly focused on pulps for the moment. Hopefully it will one day include the Boy’s Own Paper, fiction digests, 1950s “sweats” magazines and similar popular entertainment. My rationale is that a lot of this material is “pseudo-lost”; it is in the public domain and so technically belongs to everyone but remains unavailable, not because anyone is sequestering them but because few people are trying to make them universally available. Some libraries keep collections but these are relatively few in number and not widely accessible.

I own some pulp magazines and I have tried scanning a few. This involved building my own low-budget V-cradle scanner[2] and the results were mixed. Fortunately, other people have already scanned pulps and their results are available on eBay. Actually, I am aware that scans are available online but that gets into some murky, grey areas of taking something without giving anything in return and I wouldn’t feel comfortable.[3] Commerce is straightforward. I will get back to scanning my own collection eventually but third-party scans will more than suffice for now.

The scans need to be processed a bit and sometimes redacted a bit too. In doing all of this, I have acquired more knowledge than any sane person really wants or needs about US copyright law (and there are still vast gaps and obscure special cases I do not yet comprehend).[4] Sometimes whole pulp magazines are still under copyright, often only a single story or two are and the rest is public domain. Once suitably modified, the scans need to be turned into something useful. The quick and cheap way of doing so is to upload them to the Internet Archive and download their derived version (which can then be reuploaded to Commons).

Proofreading the individual pages is the bulk of the transcription work. This takes time but it is usually simple enough. A lot depends on the quality of the individual scans but the Internet Archive has pretty good OCR software. There are still errors to be corrected and line feeds to be removed but most of the text tends to be more-or-less intact and legible. Some really poor OCR’d text can appear to be nothing more than random hexadecimal strings at times.

Illustrations can take a little time with GIMP but I’ve become familiar with the kind of material with which I’m working at the moment. All are monochrome and line drawings are common. Large, complicated illustrations can take a lot of time to clean up but others just involve messing around with levels and alpha channels.

Sometimes people actually miss the last step of the transcription process: transcluding the proofread text from the Page namespace to the mainspace (it’s like having lots of templates, although there’s a tag that does it all in one). It’s pretty easy normally. Of course, my project complicates it a tad because I like to include the period adverts (seeing the fiction and articles within the context of their original setting is part of the project to my mind)[5] and sometimes judgement calls are needed on splitting things between subpages or the best way to replicate elements of the original.

Some of the material I’ve transcribed is widely available anyway. There are, for example, some Howard and Lovecraft pieces in the Weird Tales transcriptions that are cheaply available in many print collections and elsewhere on the internet. Preserving a copy of these texts as they were in the pulps is important but one of my favourite parts is making available the lesser known pieces that accompany them. Some of these works may never have been republished since the initial pulp printing and were, for all practical purposes, essentially lost works for most people. Letters pages are a fascinating source of contemporary opinions and are likewise, rarely republished (if ever).

As far as I know, no publisher has re-released any of my transcriptions, not that it would be easy to tell if they didn’t want to attribute it to Wikisource.[6] Nor am I aware of any translations on other Wikisources. Both still remain possibilities. I’ve noticed the occasional familiar-looking text on blogs, however, so they are getting out slowly. Along the same lines, the first issue of Amazing Stories is scheduled to be the featured text in May. Poe, Verne and Wells are in no danger of being forgotten but England, Hall and Wertenbaker could use a little extra attention.

None of this really interacts with Wikipedia much, short of an occasional writer’s biography proving useful, which means conversation at wikimeets and elsewhere can be a bit limited. In a way, I think that may bring us full circle to people not understanding the arcane mysteries of Wikisource.


[1] There seems to be a low level of confusion with Wikipedia’s WikiProject Citation cleanup and/or WikiProject Fact and Reference Check.

[2] The v-cradle is made of old cardboard boxes and duct tape. I have considered making an upgraded V-Cradle v.2.0, which will make the technological leap forward to Lego. The “scanner” is a digital camera.

[3] Other material falls within the same area. While it is technically legal to scan a public domain work from, say, a charity’s publication, it doesn’t feel right.

[4] Having caught glimpses of some of these unspeakable occurrences in my wanderings, I am left with the impression that no one but a specialist IP lawyer should ever attempt to engage such Eldritch Things (which are, no doubt, both ruggose and squamous). Down this path only madness lies.

[5] At least, a reasonable facsimile of the original setting. There are limits on typography when reproducing works in Wiki-HTML, which I prefer to do when ever possible but very complicated adverts may end up as image files.

[6] Wikisource, like most of Wikimedia, is hosted under a Creative Commons licence that requires attribution. However, as these works are already in the public domain, imposing Creative Commons licensing and any associated restrictions would actually be copyfraud. The act of transcription does not grant any protection under US law. Besides which, the attribution would be to the original author, not the project or transcribers.