Common wikisource

Leave a comment

As a follow up to my ramblings about Multilingual Wikisource: I have heard some people ask why all Wikisources are not Multilingual Wikisource, like Commons. (I have even heard “Why isn’t Wikisource part of Commons?”)

The latter is easily answered. Aside from the fact that Wikisource needs specific technology to function, it has a different scope and mission to Commons, which would clash if both were part of the same project.

There are many reasons for the former. I think the original was something to do with right-to-left text, which has been solved by now. Others still stand, however.

Disambiguation would be a nightmare, for example. The Bible is complicated enough in English on just one project. Multiple editions in each of hundreds of languages would be ridiculous. This could be solved with, say, namespaces but there are a finite number of namespaces in the MediaWiki software. Besides, the difference between a namespace and a language subdomain is negligible from a technological point of view. The same goes for disambiguation for that matter. A language subdomain is just a bigger version of the concept.

On a different tangent, while Commons is technically multilingual—and a lot of work has gone into supporting that—it is still predominantly English. Community communication is overwhelmingly done in English, English is the default for categories and templates, and so forth. Some grasp of English is often necessary to function on Commons. Language subdomains allow the monolingual (and the multilingual but not anglophone) Wikimedians to take part too, which is more important in curating a library than a media depository.

Obviously, now that we actually have language subdomains, we also have the problems of different cultures and communities on the different projects. Italian doesn’t allow translation, German doesn’t allow non-scans, French doesn’t allow annotation; while some languages, like English and Spanish, are pretty promiscuous in their content. There are likely to many more, seemingly trivial, quirks that are at odds across different projects. If anyone ever did attempt unification, these communities would clash and conflict all over the place, probably ending in either mutually assured destruction or a very small surviving user base.

You may as well ask why Wikipedia bothers with language subdomains when it could just be Multilingual Wikipedia, like Commons.


Multilingual ramblings

Leave a comment

Old Wikisource” (oldwikisource:) is the incubator of the Wikisources. Languages that do not yet have enough works in their library are all held here, from Akkadian to Zulu, before later potentially budding off into their own projects. They are not part of the actual Incubator because Wikisource relies on specific technology that is not installed there (and probably would need to be heavily adapted to fit it).

One problem this creates is that “oldwikisource” is not a recognised ISO 639 language code. Interwiki links do not work. Wikidata will have a hard time indexing it. No one really knows it’s there.

Fortunately, the International Organization for Standardization predicted situations such as this and included a few extra codes in their set. One of these is “mul” for multiple languages, for situations where databases need to categorise things by language but where some of those things have many. This could mean, for example, mul.wikisource, or even mul.wikipedia, mul.wikibooks, etc (although those are just possibilities, not suggestions).

In other words, exactly what Wikimedia requires for Old Wikisource. Mul could be used for interwiki links from other Wikisources, bringing some attention and potential traffic to an otherwise excluded and ostracised project. Mul could be used on Wikidata to collect and connect pages. Mul is already used in some parts of Wikisource to refer to the not-sub-domain.

It also helps that Old Wikisource, while accurate as the original project, is not as easily explained to Wikimedians on our sister projects as is “English Wikisource”, “French Wikisource” or, as it happens, “Multilingual Wikisource”.

So, for preference, I would see Old Wikisource become Multilingual Wikisource. I think it would make lots of things easier, while making the project more visible, more functional, and slightly more obvious to outsiders. It must be said that I am not a regular on Old Wikisource and those that are may not agree.

Fully enabling ISO 639 in Wikimedia would also technically affect user language options too. A user could conceivably select “Multiple” as their preferred language, regardless of where they were in Wikimedia. In practice, this would probably just default to English, so I don’t think it would be a big problem.

More serious would be the amount of trouble this would be to implement. Just creating an alias for Old Wikisource would be easiest, so the code could be used as described without really changing much.

In my view, moving the project entirely is still better: with most existing pages going to mul.wikisource.org and just a portal remaining at wikisource.org (in line with its sister projects like Wikipedia). If changes are going to be made, we might as well go all the way rather than patch the system with aliases. That’s a lot of work for relatively little gain though, and I don’t know how keen the current Old Wikisourcers would be with this option (nor the technical people who would have to do all the heavy lifting).

I haven’t actually made any proposal based on this (a some related bug reports have been open for years, however). I’m still not sure what would be best nor what the wider community would prefer and I’m just thinking, or typing, out loud. This is just a blog after all.

As it stands, though, my opinion is that Multilingual Wikisource would probably work better than Old Wikisource.

Copyright illiteracy redux

Leave a comment

Original Weird Tales illustration for

The problems of copyright-renewed works being added to Wikisource continue.  In this case, by me.  I added “Tell Your Fortune” by Robert Bloch to Wikisource as part of Weird Tales (vol. 42, no. 4, May 1950).  Bloch, author of Psycho and mentee of Lovecraft, mostly renewed his copyrights but missed the occasional piece.  He has a few letters hosted on Wikisource already but this would have been his first work of fiction.  I uploaded it, transcribed it, proofread it and eventually transcluded (ie. “published”) it when the work was done. 

And then it transpired that the copyright had been renewed after all and hosting it on Wikisource is illegal.

I honestly did try to make sure that I caught all the copyright renewals.  I checked scans of the copyright renewal catalogues, transcriptions of those scans, the US Copyright Office’s online database and Google searches.  1950 is an odd year as it was transitional; renewals can be recorded in either the old-style printed catalogues or on the newer official database. There is no complete, single source for this type of renewal.  I created Weird Tales and its subpages mostly to record information like this for this precise reason.  I did catch some other renewals in this issue, “The Last Three Ships” by Margaret St. Clair and “The Man on B-17” by August Derleth, and redacted them from the scan accordingly.  This one escaped me, however, despite being clearly entered on the Copyright Office’s database.

So, it’s worth quadruple-checking the copyrights before you do all of the work necessary to get a text on Wikisource.

There are still usable parts of the issue, such as the poem “Luna Aeternalis” by Clark Ashton Smith and the short story “The Triangle of Terror” by William F. Temple.  Smith has many works already on Wikisource but few of them are backed by scans yet (and some were recently deleted and re-hosted in Canada on Wikilivres).  Temple, a British science fiction author, is new to Wikisource.  This story is actually interesting copyright-wise because Temple only died in 1989 and so his works are still under copyright in the UK.  As this work was first published in the US, however, it is in the public domain under American law due to non-renewal.

A Margaret Thatcher Library and Museum?

Leave a comment

It seems I may have spoken too soon in my last post.

In that post I mentioned the international copyright discrepancy which meant that the only public domain work on Margaret Thatcher’s author page was one released by the Federal Government of the United States.  Modern US Presidents each have a federally operated library making such historical materials available.  It seems like there may be hope of a British equivalent after all. (NB: Technically, there is one already, the Gladstone Library in Wales, but they are rare and I don’t believe that one is quite the same thing.)

A Margaret Thatcher Library and Museum Project has recently been announced.  It was inspired by and will be modelled on the Ronald Reagan Library in California and, if it goes ahead, it will set up a similar institution in central London.  It was the idea of Donal Blaney, chief executive of Conservative Way Forward, in 2009 and is supported by members of the Conservative Party (including some secretaries of state, so this might actually happen).

The Thatcher Library is currently only proposed, although it appears to have significant backing and funding already.  I have not seen any confirmation about the library’s contents yet.  The US libraries are run as part of the National Archives & Records Administration, while the UK equivalent will be a private foundation, separate from the UK’s National Archives (although pre-Hoover US libraries are in a similar state).  How that affects the material from Margaret Thatcher’s premiership held by the Archives is unknown; The Times suggests that the Library may hold facsimiles.  There is also the Thatcher Archive at Cambridge University and The Margaret Thatcher Foundation, which cover similar ground.

It also remains to be seen just how much like a US presidential library this institution will be.  That all federal government documents are in the public domain in that country is a result of SCOTUS case law and eventual formal codification in US law.  There is nothing similar in the UK and there is no guarantee that the Thatcher Library will emulate this aspect of the US system.

Nevertheless, this could be an important step forward for the open culture movement in the United Kingdom.  Maybe one day Wikimedia UK will even be arranging a Wikimedian in Residence there and Wikisource will finally have a fuller bibliography on Margaret Thatcher’s author page.

Margaret Thatcher and the oddities of copyright

1 Comment

Portrait photograph of Margaret Thatcher

Margaret Thatcher in 1981 (Public Domain via Wikimedia Commons)

The recent, sad death of former Prime Minister Margaret Thatcher has highlighted an odd juxtaposition of international copyright laws. (Yes, this is yet another copyright post.) Her author page received 95 page views on 8th April, which is about the same traffic it usually gets in an entire month. However, the page only links to two works, one of which I have now tagged as a copyright violation (her famous “the lady’s not for turning” conference speech, which is probably under copyright until the early 2080’s).

The somewhat odd situation being highlighted is due to the other work on that page. I transcribed and added it not long ago. It’s a memcon, a memorandum of a conversation, with President Gerald Ford before Mrs. Thatcher became Prime Minister. This document was made available by the Gerald R. Ford Presidential Library and Museum, part of the American presidential library system. It is in the public domain because all works by officers of the Federal Government of the United States, made as part of their official duties, are in the public domain under United States law. So, it would appear that the only way to read any of the works of Margaret Thatchers, Prime Minister of the United Kingdom of Great Britain and Northern Ireland, on Wikisource is via the government of the United States of America.

Fortunately, there is also a link to the Margaret Thatcher Foundation website, which does have a complete, online collection of all of her speeches, interviews, etc. So they are not lost or hidden but they aren’t free. It is not necessarily a problem, bar potentially limiting distribution and preventing things like crowd-sourced translation. It is, nevertheless, still a very odd position in which to be.

The internal cost of copyright illiteracy


More so than most other Wikimedia projects, except perhaps Commons, copyright is a big deal for Wikisource.  Obviously we can only host public domain or freely licensed works; which is generally understood.  The problem comes from copyright law itself not being generally understood.  (I can’t claim to be especially knowledgeable about copyright myself but I have picked up a lot as part of the Wikisource community.)

Many people apparently believe certain works must or should be out of copyright without checking or they do check but miss some detail of copyright law.  Wikisource as a project can deal with this by deletion but it still impacts volunteers.

A recent example is the science fiction short story “Time Pawn” by Philip K. Dick, a story that was published in 1954 in an issue of Thrilling Wonder Stories.  Under the law of the time, the initial copyright period ended in 1982 when it could have been renewed for another period.  As this didn’t happen it would seem to have entered the public domain.  However, while the short story was not renewed, the issue of the magazine itself was, under renewal registration number RE0000112616 in January 1982 by CBS Publications.  It has been established, in Goodis v. United Artists Television, Inc., “that where a magazine has purchased the right of first publication under circumstances which show that the author has no intention to donate his work to the public, copyright notice in the magazine’s name is sufficient to obtain a valid copyright on behalf of the beneficial owner, the author or proprietor.”  Lacking information to the contrary, we must assume that this applies to Dick’s story; the renewal of the copyright on Thrilling Wonder Stories also renewed the copyright on “Time Pawn” so, unless it was reassigned, CBS currently hold the rights on the story until about 2050.

The real issue here is that another user, not the uploader, completed the proofreading of the entire story in good faith.  At which point it was noticed by yet another user and rightly marked it as a copyright violation.  Now that good-faith user’s effort is wasted and they may be permanently disillusioned with the project.  Everyone loses.

This is actually partly my fault.  I noticed the upload and I tagged a separate, similar upload (“Small Town“) for deletion for the same reason but I didn’t connect the two.

I’m not sure what else can be done to prevent things like this from happening.  Both Wikisource and Commons already have help pages on copyright that should explain the problem.  Constant vigilance (and better awareness on my part, at least) may be the only solution, but that is unlikely to be foolproof.

Note 1: “Small Town” was published in Amazing Stories, which hardly ever had its copyrights renewed, in the very first issue to do so.  Conversely, Thrilling Wonder Stories, along with the entire “Thrilling…” stable of magazines, apparently had consistent copyright renewals across the board.  Ironically, that isn’t true under its earlier incarnation as simply Wonder Stories, a pulp also created by Hugo Gernsback after he lost control of Amazing Stories.

Note 2: A later version of “Time Pawn” (published in Startling Stories, Summer 1955) appears to have been renewed as well, under RE0000190631 in 1983 by Dick’s children.  This may or may not be relevant; a court could declare it close enough.

A peak at the arcane mysteries of Wikisource

Leave a comment

I have heard, both in meatspace and online, that Wikisource is mysterious and hard to understand. I don’t agree with this but I’ve been involved with the project for years now so I may have just been institutionalised.

While there are many Wikimedians who do not recognise the project,[1] I think most get the gist that it is a digital library. I’ve heard “Like Project Gutenberg but on a wiki” and “Wikipedia library”. Neither are ideal but they are close enough. Wikisource is actually one of the largest projects — about third in page count after Wikipedia and Wiktionary, probably fourth overall if Commons is bodged in as second — so it can be assumed that it is less obscure than Wikibooks et al. Nevertheless the confusion seems to persist.

There are other projects to promote and document Wikisource, so I thought I would try a different tack and explain by example.

One of my pet projects is the transcription of pulp magazines. So far, this includes some issues of Amazing Stories, Avon Fantasy Reader and Weird Tales.  Actually, I intend this to be a slightly wider project but it’s mainly focused on pulps for the moment. Hopefully it will one day include the Boy’s Own Paper, fiction digests, 1950s “sweats” magazines and similar popular entertainment. My rationale is that a lot of this material is “pseudo-lost”; it is in the public domain and so technically belongs to everyone but remains unavailable, not because anyone is sequestering them but because few people are trying to make them universally available. Some libraries keep collections but these are relatively few in number and not widely accessible.

I own some pulp magazines and I have tried scanning a few. This involved building my own low-budget V-cradle scanner[2] and the results were mixed. Fortunately, other people have already scanned pulps and their results are available on eBay. Actually, I am aware that scans are available online but that gets into some murky, grey areas of taking something without giving anything in return and I wouldn’t feel comfortable.[3] Commerce is straightforward. I will get back to scanning my own collection eventually but third-party scans will more than suffice for now.

The scans need to be processed a bit and sometimes redacted a bit too. In doing all of this, I have acquired more knowledge than any sane person really wants or needs about US copyright law (and there are still vast gaps and obscure special cases I do not yet comprehend).[4] Sometimes whole pulp magazines are still under copyright, often only a single story or two are and the rest is public domain. Once suitably modified, the scans need to be turned into something useful. The quick and cheap way of doing so is to upload them to the Internet Archive and download their derived version (which can then be reuploaded to Commons).

Proofreading the individual pages is the bulk of the transcription work. This takes time but it is usually simple enough. A lot depends on the quality of the individual scans but the Internet Archive has pretty good OCR software. There are still errors to be corrected and line feeds to be removed but most of the text tends to be more-or-less intact and legible. Some really poor OCR’d text can appear to be nothing more than random hexadecimal strings at times.

Illustrations can take a little time with GIMP but I’ve become familiar with the kind of material with which I’m working at the moment. All are monochrome and line drawings are common. Large, complicated illustrations can take a lot of time to clean up but others just involve messing around with levels and alpha channels.

Sometimes people actually miss the last step of the transcription process: transcluding the proofread text from the Page namespace to the mainspace (it’s like having lots of templates, although there’s a tag that does it all in one). It’s pretty easy normally. Of course, my project complicates it a tad because I like to include the period adverts (seeing the fiction and articles within the context of their original setting is part of the project to my mind)[5] and sometimes judgement calls are needed on splitting things between subpages or the best way to replicate elements of the original.

Some of the material I’ve transcribed is widely available anyway. There are, for example, some Howard and Lovecraft pieces in the Weird Tales transcriptions that are cheaply available in many print collections and elsewhere on the internet. Preserving a copy of these texts as they were in the pulps is important but one of my favourite parts is making available the lesser known pieces that accompany them. Some of these works may never have been republished since the initial pulp printing and were, for all practical purposes, essentially lost works for most people. Letters pages are a fascinating source of contemporary opinions and are likewise, rarely republished (if ever).

As far as I know, no publisher has re-released any of my transcriptions, not that it would be easy to tell if they didn’t want to attribute it to Wikisource.[6] Nor am I aware of any translations on other Wikisources. Both still remain possibilities. I’ve noticed the occasional familiar-looking text on blogs, however, so they are getting out slowly. Along the same lines, the first issue of Amazing Stories is scheduled to be the featured text in May. Poe, Verne and Wells are in no danger of being forgotten but England, Hall and Wertenbaker could use a little extra attention.

None of this really interacts with Wikipedia much, short of an occasional writer’s biography proving useful, which means conversation at wikimeets and elsewhere can be a bit limited. In a way, I think that may bring us full circle to people not understanding the arcane mysteries of Wikisource.


[1] There seems to be a low level of confusion with Wikipedia’s WikiProject Citation cleanup and/or WikiProject Fact and Reference Check.

[2] The v-cradle is made of old cardboard boxes and duct tape. I have considered making an upgraded V-Cradle v.2.0, which will make the technological leap forward to Lego. The “scanner” is a digital camera.

[3] Other material falls within the same area. While it is technically legal to scan a public domain work from, say, a charity’s publication, it doesn’t feel right.

[4] Having caught glimpses of some of these unspeakable occurrences in my wanderings, I am left with the impression that no one but a specialist IP lawyer should ever attempt to engage such Eldritch Things (which are, no doubt, both ruggose and squamous). Down this path only madness lies.

[5] At least, a reasonable facsimile of the original setting. There are limits on typography when reproducing works in Wiki-HTML, which I prefer to do when ever possible but very complicated adverts may end up as image files.

[6] Wikisource, like most of Wikimedia, is hosted under a Creative Commons licence that requires attribution. However, as these works are already in the public domain, imposing Creative Commons licensing and any associated restrictions would actually be copyfraud. The act of transcription does not grant any protection under US law. Besides which, the attribution would be to the original author, not the project or transcribers.

Why Wikisource?


One place to start when talking about Wikisource is, “Why bother?” There are many other digital libraries, from Project Gutenberg to the Internet Archive. What separates Wikisource from them?

In fact, this was an early response to the proposal of a Wikisource-like project back in 2001. Larry Sanger was one of the first to comment, saying:

The hard question, I guess, is why we are reinventing the wheel, when Project Gutenberg already exists? I mean, what really is the need for having this project?

This was closely followed by none other than Jimmy Wales himself, who said:

Like Larry, I’m interested that we think it over to see what we can add to Project Gutenberg. It seems unlikely that primary sources should in general be editable by anyone.

So what does separate Wikisource from similar projects? What are Wikisource’s unique selling points?

Wikisource has many things in common with other libraries but many unique qualities as well. A quick list of unique selling points, as I see them at least, would be accessible scans, crowdsourced proofreading and potential for added value. Gutenberg has proofreading but its sources are hidden. The Internet Archive has scans but only error-ridden computer-transcribed text. Other digital libraries fall into one or the other of these camps. Wikisource, however, combines the reliability of the scans with human-made transcriptions.

Together anyone can contribute to proofreading, regardless of personal resources and access to texts. Once proofread, anything can be checked and corrected. If, for example, you doubt a spelling, a scan of the original page is just a click away where is can be confirmed or corrected.

Added value, such as wikilinking certain terms or embedding spoken-word versions, just adds more to this already pretty solid foundation.

To Annotate or Not to Annotate?

Leave a comment

I have recently been reminded about the topic of annotation.

Annotation remains a vexed issue on the English Wikisource. Not all Wikisources accept annotations; English used to be one that did. After a contentious debate the entire policy ended up being blanked pending any sort of consensus and has remained that way for over a year. That just lead to a sort of no-man’s-land, with different editors doing their own, potentially contradictory things.

The main issue, of course, is whether or not Wikisource should host texts with user-generated annotations.

Part of Wikisource’s mission is to provide accessible copies of source texts. Texts that should remain as faithful and pure as possible. Wikisource does not even correct typos.

Being a wiki, however, the texts could have added depth and usefulness if they provided more information. Place names, for example, change over time and perhaps a reader does not know that Constantinople is Istanbul. It’s simple to add this to the text, in many different ways, but if you do, then the text is slightly less faithful and slightly less pure than it could have been.

That leads to the next two issues: What counts as an annotation and how much is allowed, if any. Some say that even a humble wikilink is an annotation and these must all be purged to maintain textual purity. Users have removed wikilinks for this reason in the past. Others go further than wikilinks and add new footnotes, diagrams and maps to help improve the clarity of a text. Most users are somewhere in between; I’ve done both all of the above.

A casual reader can be helped by having information put in context, or locations pointed out on maps, or have names linked to full biographies. However, if a reader wants to know what exactly a reader in the past would have read, or what a specific author actually published, then user annotations start to obfuscate matters, even if marked.

Keeping multiple copies of texts is one solution: a pure text and a clearly marked annotated version. That doubles work load, however, and presents some technology problems. Technology might be a solution, with the mooted “onion skin” Wikisource 3.0, but that remains theoretical at the moment. Hebrew Wikisource, the oldest standalone Wikisource, uses a special namespace just for annotations, although it is currently the only one to do so. If we are going to put the text somewhere else, why not a different project altogether? This does technically fall within Wikibooks’ bailiwick but will simple wikilinks be enough on that project and are they going to be happy with the buck being passed to them? Even if so, how would we stop new users coming along and putting wikilinks on a Wikimedia project?

The case continues.

Older Entries Newer Entries