Wiki log jam

Leave a comment

Every so often a log jam appears to happen in my life: meatspace or virtual. I can’t point to any one issue but not a lot gets done when this happens.

On Wikisource, this means I have not achieved a lot in a month or so. Perhaps worse, each task to which I have committed myself to is now getting in the way of every other task, further diluting whatever effort I make. I’ve tried to cut down on tasks but more keep coming up and I really want to do a lot of them.

Wikisource, and the other sister projects in the family (‘pedia, ‘voyage et al), appear to be quite bad for this. It’s hard to keep focussed sometimes.

2013 was going to be my Year of Amazing Stories, but that is almost certainly bumped to 2014 now. I need to finish Weird Tales first. I have most of one issue left (of those available) and bits and pieces across the others, because I used to do a page or two when I only had a little spare time, so I skipped some of the more complicated stuff.  Especially the adverts.

Two months left to do all that, if I want to push Amazing Stories from the start of next year. And I want to complete all the acts from Elizabeth I’s first regnal year. And I have a Confederate roster to reformat and reassemble. And I have several national anthems to code out in Score. And some policies to write. And Lua to learn. And…

Common wikisource

Leave a comment

As a follow up to my ramblings about Multilingual Wikisource: I have heard some people ask why all Wikisources are not Multilingual Wikisource, like Commons. (I have even heard “Why isn’t Wikisource part of Commons?”)

The latter is easily answered. Aside from the fact that Wikisource needs specific technology to function, it has a different scope and mission to Commons, which would clash if both were part of the same project.

There are many reasons for the former. I think the original was something to do with right-to-left text, which has been solved by now. Others still stand, however.

Disambiguation would be a nightmare, for example. The Bible is complicated enough in English on just one project. Multiple editions in each of hundreds of languages would be ridiculous. This could be solved with, say, namespaces but there are a finite number of namespaces in the MediaWiki software. Besides, the difference between a namespace and a language subdomain is negligible from a technological point of view. The same goes for disambiguation for that matter. A language subdomain is just a bigger version of the concept.

On a different tangent, while Commons is technically multilingual—and a lot of work has gone into supporting that—it is still predominantly English. Community communication is overwhelmingly done in English, English is the default for categories and templates, and so forth. Some grasp of English is often necessary to function on Commons. Language subdomains allow the monolingual (and the multilingual but not anglophone) Wikimedians to take part too, which is more important in curating a library than a media depository.

Obviously, now that we actually have language subdomains, we also have the problems of different cultures and communities on the different projects. Italian doesn’t allow translation, German doesn’t allow non-scans, French doesn’t allow annotation; while some languages, like English and Spanish, are pretty promiscuous in their content. There are likely to many more, seemingly trivial, quirks that are at odds across different projects. If anyone ever did attempt unification, these communities would clash and conflict all over the place, probably ending in either mutually assured destruction or a very small surviving user base.

You may as well ask why Wikipedia bothers with language subdomains when it could just be Multilingual Wikipedia, like Commons.

A peak at the arcane mysteries of Wikisource

Leave a comment

I have heard, both in meatspace and online, that Wikisource is mysterious and hard to understand. I don’t agree with this but I’ve been involved with the project for years now so I may have just been institutionalised.

While there are many Wikimedians who do not recognise the project,[1] I think most get the gist that it is a digital library. I’ve heard “Like Project Gutenberg but on a wiki” and “Wikipedia library”. Neither are ideal but they are close enough. Wikisource is actually one of the largest projects — about third in page count after Wikipedia and Wiktionary, probably fourth overall if Commons is bodged in as second — so it can be assumed that it is less obscure than Wikibooks et al. Nevertheless the confusion seems to persist.

There are other projects to promote and document Wikisource, so I thought I would try a different tack and explain by example.

One of my pet projects is the transcription of pulp magazines. So far, this includes some issues of Amazing Stories, Avon Fantasy Reader and Weird Tales.  Actually, I intend this to be a slightly wider project but it’s mainly focused on pulps for the moment. Hopefully it will one day include the Boy’s Own Paper, fiction digests, 1950s “sweats” magazines and similar popular entertainment. My rationale is that a lot of this material is “pseudo-lost”; it is in the public domain and so technically belongs to everyone but remains unavailable, not because anyone is sequestering them but because few people are trying to make them universally available. Some libraries keep collections but these are relatively few in number and not widely accessible.

I own some pulp magazines and I have tried scanning a few. This involved building my own low-budget V-cradle scanner[2] and the results were mixed. Fortunately, other people have already scanned pulps and their results are available on eBay. Actually, I am aware that scans are available online but that gets into some murky, grey areas of taking something without giving anything in return and I wouldn’t feel comfortable.[3] Commerce is straightforward. I will get back to scanning my own collection eventually but third-party scans will more than suffice for now.

The scans need to be processed a bit and sometimes redacted a bit too. In doing all of this, I have acquired more knowledge than any sane person really wants or needs about US copyright law (and there are still vast gaps and obscure special cases I do not yet comprehend).[4] Sometimes whole pulp magazines are still under copyright, often only a single story or two are and the rest is public domain. Once suitably modified, the scans need to be turned into something useful. The quick and cheap way of doing so is to upload them to the Internet Archive and download their derived version (which can then be reuploaded to Commons).

Proofreading the individual pages is the bulk of the transcription work. This takes time but it is usually simple enough. A lot depends on the quality of the individual scans but the Internet Archive has pretty good OCR software. There are still errors to be corrected and line feeds to be removed but most of the text tends to be more-or-less intact and legible. Some really poor OCR’d text can appear to be nothing more than random hexadecimal strings at times.

Illustrations can take a little time with GIMP but I’ve become familiar with the kind of material with which I’m working at the moment. All are monochrome and line drawings are common. Large, complicated illustrations can take a lot of time to clean up but others just involve messing around with levels and alpha channels.

Sometimes people actually miss the last step of the transcription process: transcluding the proofread text from the Page namespace to the mainspace (it’s like having lots of templates, although there’s a tag that does it all in one). It’s pretty easy normally. Of course, my project complicates it a tad because I like to include the period adverts (seeing the fiction and articles within the context of their original setting is part of the project to my mind)[5] and sometimes judgement calls are needed on splitting things between subpages or the best way to replicate elements of the original.

Some of the material I’ve transcribed is widely available anyway. There are, for example, some Howard and Lovecraft pieces in the Weird Tales transcriptions that are cheaply available in many print collections and elsewhere on the internet. Preserving a copy of these texts as they were in the pulps is important but one of my favourite parts is making available the lesser known pieces that accompany them. Some of these works may never have been republished since the initial pulp printing and were, for all practical purposes, essentially lost works for most people. Letters pages are a fascinating source of contemporary opinions and are likewise, rarely republished (if ever).

As far as I know, no publisher has re-released any of my transcriptions, not that it would be easy to tell if they didn’t want to attribute it to Wikisource.[6] Nor am I aware of any translations on other Wikisources. Both still remain possibilities. I’ve noticed the occasional familiar-looking text on blogs, however, so they are getting out slowly. Along the same lines, the first issue of Amazing Stories is scheduled to be the featured text in May. Poe, Verne and Wells are in no danger of being forgotten but England, Hall and Wertenbaker could use a little extra attention.

None of this really interacts with Wikipedia much, short of an occasional writer’s biography proving useful, which means conversation at wikimeets and elsewhere can be a bit limited. In a way, I think that may bring us full circle to people not understanding the arcane mysteries of Wikisource.


[1] There seems to be a low level of confusion with Wikipedia’s WikiProject Citation cleanup and/or WikiProject Fact and Reference Check.

[2] The v-cradle is made of old cardboard boxes and duct tape. I have considered making an upgraded V-Cradle v.2.0, which will make the technological leap forward to Lego. The “scanner” is a digital camera.

[3] Other material falls within the same area. While it is technically legal to scan a public domain work from, say, a charity’s publication, it doesn’t feel right.

[4] Having caught glimpses of some of these unspeakable occurrences in my wanderings, I am left with the impression that no one but a specialist IP lawyer should ever attempt to engage such Eldritch Things (which are, no doubt, both ruggose and squamous). Down this path only madness lies.

[5] At least, a reasonable facsimile of the original setting. There are limits on typography when reproducing works in Wiki-HTML, which I prefer to do when ever possible but very complicated adverts may end up as image files.

[6] Wikisource, like most of Wikimedia, is hosted under a Creative Commons licence that requires attribution. However, as these works are already in the public domain, imposing Creative Commons licensing and any associated restrictions would actually be copyfraud. The act of transcription does not grant any protection under US law. Besides which, the attribution would be to the original author, not the project or transcribers.


Leave a comment

I didn’t notice it when I was proofreading the page.

I didn’t notice it when I was transcluding the page.

I did notice it somewhere beneath Belgravia when I was re-reading it on my Bebook One.

Typos or, more accurately, “Scannos“, uncorrected OCR errors, are a constant problem.  At least for me.  Despite all the measures to prevent them, I still find some later on my third re-read of the material I proofread in the first place,

In many ways, proofreading is never necessarily complete.   There is always the chance that you missed something regardless of however many times you read through it.

Pulps, letters and science fiction fans

Leave a comment

In the process of my ongoing work to put Weird Tales and other pulps on Wikisource, I have found letters pages one of the more awkward things to transcribe. One of my recent tweaks is adding author pages for every published letter writer.

In the past have found published authors and notable people among these epistoleans, many of whom I did not know prior to this. Some were found by idly googling their name; some listed on the Internet Speculative Fiction Database (ISFDB); some only turned up when I wikilinked their name and it wasn’t red.

In any case, they are all technically published authors and Wikisource has no notability restrictions. Besides which, I’m not able to pick out just the “important” ones.

Therefore, author pages for all of them.

On the downside: A lot of these author would be treated as trivial and certainly wouldn’t make it on Wikipedia. Fortunately, as mentioned, Wikisource’s criterion is generally being published over notability. It is also going to be difficult if not impossible to get a much metadata beyond anything noted in the letter.

On the upside: There is a certain democracy to everyone getting an author page for writing a letter to a pulp magazine in the 1930s. This also serves to create a record of fans and readers of these magazines, with at least a little metadata, not to mention a historic record of people who may not otherwise have one. More practically, it enables tracking of people with multiple published letters, especially if over different magazines.