Languages in real-world ePubs

by Liza Daly

We were curious about the distribution of languages for ePubs on Bookworm (Ibis Reader doesn’t yet have enough titles to be representative yet.)

The following information is derived from the dc:language field in the OPF file.

Here’s the chart:

Missing from the chart, of course, is English. It’s so overrepresented it skews the chart to the point of being unreadable.

Of the 62,000 epubs on Bookworm right now:

  • 29,642 have no language value
  • A little over 20,000 are English (combining various values like “en”, “en-GB”, or — embarrassingly — “American”)
  • The remainder, 5,874, are distributed among all other languages
  • Almost half of the values are represented just one time (likely bad data)

I found it very interesting that the most represented non-English language code is cs — Czech — by a huge margin. Any ideas why?

Wondering which values are correct? The OPF 2.0 spec is unambiguous:

The content of this element [dc:language] must comply with RFC 3066

(Also, does anyone speak “Robert”?)