denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)
Denise ([staff profile] denise) wrote in [site community profile] dw_biz2012-06-19 09:23 pm

RFC: Specifying languages in profiles: how should we do it?

This entry is being posted on behalf of the programmer who is working on a bug that sprang from a suggestion to make it easy for people to find other people blogging in languages they speak. (They wrote it; I'm just posting it! I don't want to take credit for any of this.) We're looking for some thoughts on our current ideas and want to make sure we aren't missing something super obvious :)

Turning it over now:



Way way back in 2009, it was suggested that we create a way of stating which languages a journal uses. There's obvious advantages to this: it'll make it easier to find non-anglophone areas of Dreamwidth for those as want to, as a first or second or third etc language. This is a feature we're going to implement: what we want to know is how. (For those of you who are interested, there's a fair bit of discussion going on in the Bugzilla comments, but it's all repeated here.)

This post has been separated out into the separate areas that need refinement. In each section we outline our current thoughts: we'd love it if you gave us feedback on them, and we'd love it even more if you came up with an obvious better solution we've not thought of yet. :-)

The areas in question are:
  1. Languages field: usage
  2. Language entry options
  3. What does our standardised list look like?
  4. How do we choose the languages on our standardised list?
  5. How do we organise the list?
  6. Any Other Business


1. Languages field: usage


The original suggestion was for a field allowing people to state which languages they update in. Users are quite likely to enter languages they read in any such field: we need to think about how to handle this.

Options we can think of are:
  1. word the legend very carefully, and accept that the field will sometimes be used inaccurately
  2. provide separate "writes in" and "reads" boxes under an overall "Languages" heading
  3. if we go with 2, optionally include a "reads is the same as writes" (or "writes is the same as reads") tickybox, to reduce work for the user


2. Language entry options


Users will want to enter more than one language; therefore, this needs to be possible.

Ideally we would provide a standardised list so that we did not end up with confusion between "French", "french", "francais", "français" etc all pointing to different places, as they currently do if listed in interests.

However, this means we need to think about how to present this list. Some ideas that have been tossed around so far:
  1. provide check-boxes, possibly making the languages section collapsible. Advantage: easiest to select multiple options. Disadvantage: takes up a lot of use on the Manage Profile page, and will probably be edited much less frequently than e.g. interests or bio.
  2. provide a single drop-down, with the option to "add another" to produce another drop-down.
  3. provide a free-text box that behaves like the "Tags" field in the Create Entries page: it is possible to type in anything that you like, but you will be given suggestions from a standardised list and you'll be able to click "browse" to be shown the full list with check-boxes.
  4. implement one of (1) or (2), and provide a link reading something like "your language not listed?", which will reveal a free-text box for languages not on our standardised list.


Combinations of the above are naturally also possible, and we're very open to better ideas!

3. What does our standardised list look like?


As discussed above, we would like a standardised list. This gives us two problems: what our standard should be.

In comments in [site community profile] dw_suggestions, it was suggested that we use BCP-47 international language tags, wherein, for instance, "en-GB" is British English and "ta-SG" is Singaporean Tamil. It's also possible to specify scripts, allowing people to distinguish whether they're writing ru-Latn (Russian in Latin script) or ru-Cyrl (Russian in Cyrillic script).

Using only the BCP-47 tags is somewhat opaque, but they do allow a way to be very specific -- while also potentially allowing translation.

One possible model would be associating the BCP-47 tag names and the language names or descriptions in a more readable form, i.e. en would be associated with the string "English". (The ideal would be to make these associations such that if translations of Dreamwidth occur, it would be trivial to generate a file of language names in the target language while preserving the associations, e.g. "Anglais" would automatically map to en.)

We also need to consider whether it would be useful to allow people to restrict their searches further. For example, a user searching for "de" (German) would have all results returned, including de-AT (Austrian German), de-CH (Swiss German), de-DE (standard German), de-1996 (German pre-spelling-reform), etc - but do we want to allow these subtags, and therefore allow people to narrow their searches for only people writing in Swiss German?

4. How do we choose the languages on our standardised list?


Of course, the biggie - and the one we'd most like input on - is how to choose the "seed list" languages in the first place. This is the area where - we think - we're most likely to muck up and (at best) create an unhelpful list, so it's where we'd most like your input.

Methods currently under consideration (as ever, please suggest more):
  1. Grab the top 15 countries from Dreamwidth's usage statistics, and use their official languages as the seed list. Short and sweet. Possibly too short; privileges English over native, regional, and indigenous languages in countries that were colonised (New Zealand is the notable exception to this). Would result in ~25 languages on the list.
  2. Grab the top 15 countries from Dreamwidth's usage statistics, and use their official languages and their recognised regional languages as the seed list. This would mean that for, say, the UK, in addition to English there would also be the options Irish, Scottish Gaelic, Ulster Scots, Welsh, Cornish, etc. This list would be much, much longer (probably 50-150 languages on the list).
  3. hold a poll in a [site community profile] dw_news post, and populate the seed list with any language that gets more than n votes (in which case, what should n be?)
  4. what kind of suggestions system should we have for adding new languages to the standard list?
  5. some combination of the above with the top 15 global languages, which, while it adds length, has the advantage that we're not privileging English quite so ridiculously.


5. How do we organise the list?


  • by language tag? i.e. en-GB, en-US, en-CA..?
  • by language name? i.e. English, French, Japanese, Tagalog...?
  • by both? i.e. English (en) --> en-GB/British English, en-CA/Canadian English; French (fr) --> fr-CA/Canadian French, fr-FR...?
  • by country? i.e. Canada: English, French; Singapore: English, Tamil, Chinese, Malay?


6. Any Other Business?


We're sure there's things we're forgetting - these questions are only what's come out of two people thinking about this on-and-off for two days, and more brains is better brains! Please, please let us know what we're missing, and let us know what you think the correct course of action among the options listed above should be.
thorfinn: <user name="seedy_girl"> and <user name="thorfinn"> (Default)

Don't have a seed list. And language tag(s) should be available on individual posts/comments

[personal profile] thorfinn 2012-06-21 02:06 am (UTC)(link)
BCP-47 looks like a source that might be about as sane as you can get, and has the benefit that you don't have to try to maintain it yourself, and thereby get it wrong in a whole bunch of exciting and different ways. I'm not 100% sure how the list of sub languages are maintained, e.g. en-GB, en-AU definitely exist, but is en-SG valid and who decides? Those don't appear to be in that list, but I haven't researched how the sub tags work. Oddly en-GB-oed (grandfathered) and en-scouse (redundant) appear to be listed directly...

As far as a "seed list" goes - why? Don't have one. Have the entire BCP-47 list with sub-region tags (displayed as your "both" format above), a text box, and auto-suggest typing or a popup-selection once the typing is enough. Especially if you're doing it on the profile, this is a once off thing being done for each journal. Shortlisting from a large list is inherently ugly, and you don't need to optimise it away because this is essentially a rare operation. Don't fix a problem that isn't a problem. :-)

This thing should apply to individual posts/comments?, to identify the language(s) the post/comment is written in, if desired. After all, most of my journal is in en-AU, but I might want to write in pig-latin or tengwar (is tengwar a script only? I can't find a matching language tag?) for some reason.

On posts (and comments?), yes, you should have an short list - but you can pull that short list directly from the profile of the user doing the posting/commenting, and you know that's a useful shortlist because that particular human picked it.
inoru_no_hoshi: The most ridiculous chandelier ever: shaped like a penis. Text: Sparklepeen. (Default)

Re: Don't have a seed list. And language tag(s) should be available on individual posts/comments

[personal profile] inoru_no_hoshi 2012-06-21 03:16 am (UTC)(link)
Tengwar is a script, I believe; you can write Quenya and Sindarin in it, IIRC. (Also, the language of Mordor - remember the Ring?) Probably there are ways to write a lot of languages using Tengwar.

As I have Singaporean friends, I have to posit that en-SG is entirely valid; they have some manners of phrasing that most closely resemble en-GB (for obvious reasons), but the sentence structure can be very influenced by Chinese, and they have some ways of emphasising that I don't see anyone not from that general region using. It's entirely cross-understandable, but it is its own dialect, inasmuch as there are differences between English in the UK, US, and Australia/NZ/etc., and we're not touching sub-dialects because yes, we know. :P :)
Edited (Spelling, I can has it?) 2012-06-21 03:17 (UTC)
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

Tengwar

[personal profile] pne 2012-06-21 05:23 am (UTC)(link)
Tengwar is a script, I believe; you can write Quenya and Sindarin in it, IIRC. (Also, the language of Mordor - remember the Ring?) Probably there are ways to write a lot of languages using Tengwar.

*nods* A script. In addition to the languages you mention, there are also a couple of “modes” for writing English (used on the title page of LotR, for example, IIRC).

It even has an ISO 15924 script tag “Teng”, so you could mark an entry as being in sjn-Teng (Sindarin, written in Tengwar), or even en-AU-Teng (Australian English, written in Tengwar) (since BCP 47 uses ISO 15924 as one of its sources).
turlough: detail from map of Middle Earth, art by Pauline Baynes ((tolkien) the realm of gondor)

Re: Tengwar

[personal profile] turlough 2012-06-21 06:32 pm (UTC)(link)
It even has an ISO 15924 script tag “Teng”, so you could mark an entry as being in sjn-Teng (Sindarin, written in Tengwar), or even en-AU-Teng (Australian English, written in Tengwar) (since BCP 47 uses ISO 15924 as one of its sources).

I didn't know that. How cool! /former tolklang nerd
thorfinn: <user name="seedy_girl"> and <user name="thorfinn"> (Default)

Re: Don't have a seed list. And language tag(s) should be available on individual posts/comments

[personal profile] thorfinn 2012-06-21 06:58 am (UTC)(link)
Ah, and there we are:

%%
Type: language
Subtag: qya
Description: Quenya
Added: 2009-07-29
%%
Type: language
Subtag: sjn
Description: Sindarin
Added: 2009-07-29

of course. :-) I haven't refreshed my Tolkien brain cells in years, and couldn't remember what the languages were actually called.

And yes, en-SG should probably be valid, and maybe en-MY (my family is from Johor, just over the causeway, and plenty of bits of it are in SG)... but what about en-CN? en-TW? cmn-BR? :-)

I guess one should just allow any language-region-script-variant combinations that are in the list, and not restrict things...
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

Re: Don't have a seed list. And language tag(s) should be available on individual posts/comments

[personal profile] pne 2012-06-21 07:02 am (UTC)(link)
I guess one should just allow any language-region-script-variant combinations that are in the list, and not restrict things...

+1
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

IETF language tags

[personal profile] pne 2012-06-21 05:21 am (UTC)(link)
I'm not 100% sure how the list of sub languages are maintained, e.g. en-GB, en-AU definitely exist, but is en-SG valid and who decides? Those don't appear to be in that list, but I haven't researched how the sub tags work. Oddly en-GB-oed (grandfathered) and en-scouse (redundant) appear to be listed directly...

Short answer: the odd tags are there for historical reasons from before language tags were done the way they are now (old: single token; new: nearly unlimited composition of subtags); "grandfathered" ones can't be expressed/analysed identically in the new system (but many such tags have a "Preferred-Value" showing how they can be expressed differently) while "redundant" ones can be; en-AU is not in the list explicitly because it can be composed from the language subtag "en" and the region subtag "AU" that are in the list.

Long answer: see BCP 47 (e.g. http://www.rfc-editor.org/rfc/bcp/bcp47.txt ). Or email me and I'll share what I know, as this is something I'm interested in and have spent a bit of time on, but probably shouldn't be hashed out entirely in comments on this entry.
thorfinn: <user name="seedy_girl"> and <user name="thorfinn"> (Default)

Re: IETF language tags

[personal profile] thorfinn 2012-06-21 07:03 am (UTC)(link)
I checked out wikipedia and this reply and I am more enlightened. :-)

A very quick search engine poke doesn't reveal me any BCP-47 jQuery/javascript pickers out there, but I'm thinking there's definitely value in a well coded one. :-)
thorfinn: <user name="seedy_girl"> and <user name="thorfinn"> (Default)

Some Implementation mumbling

[personal profile] thorfinn 2012-06-21 07:25 am (UTC)(link)
I'm thinking it may actually be worth hashing out some of this - this whole thing is starting to remind me of the X11 font picker, but I don't know enough about BCP47...

Ah. I begin to see some of the difficulties inherent when I skimmed past this part of BCP47...

   The choice of subtags used to form a language tag SHOULD follow these
   guidelines:

   1.  Use as precise a tag as possible, but no more specific than is
       justified.  Avoid using subtags that are not important for
       distinguishing content in an application.

       *  For example, 'de' might suffice for tagging an email written
          in German, while "de-CH-1996" is probably unnecessarily
          precise for such a task.

       *  Note that some subtag sequences might not represent the
          language a casual user might expect.  For example, the Swiss
          German (Schweizerdeutsch) language is represented by "gsw-CH"
          and not by "de-CH".  This latter tag represents German ('de')
          as used in Switzerland ('CH'), also known as Swiss High German
          (Schweizer Hochdeutsch).  Both are real languages, and
          distinguishing between them could be important to an
          application.


... So, yes, we do want to support the flexibility, but on the other hand, that means that people may do the wrong thing.

I begin to suspect that the right UI thing to do is definitely to have the full list available somehow as some kind of complicated picker on the user's profile page, including help text and/or FAQ link to common language choices, and then just allow people to choose from their own profile's shortlist as they post/comment (if they wish to).

I guess the majority of people will pick a top level language or three, and feel no need to delve into sub tags at all, so the picker should be built to support and encourage that behaviour.

People who want a region will probably be looking for it already.

Searching shouldn't be too hard - if you're searching for stuff in "en", "cmn", or "qya", then you should get all the sub tags, and if you're wanting specific sub tags then you can search for those.

I'm not entirely sure how the existing tag back end works for searching, but I imagine the BCP47 tags/sub tags could be integrated into that back end or a similar thing.