denise: Image: Me, facing away from camera, on top of the Castel Sant'Angelo in Rome (Default)
Denise ([staff profile] denise) wrote in [site community profile] dw_biz2012-06-19 09:23 pm

RFC: Specifying languages in profiles: how should we do it?

This entry is being posted on behalf of the programmer who is working on a bug that sprang from a suggestion to make it easy for people to find other people blogging in languages they speak. (They wrote it; I'm just posting it! I don't want to take credit for any of this.) We're looking for some thoughts on our current ideas and want to make sure we aren't missing something super obvious :)

Turning it over now:



Way way back in 2009, it was suggested that we create a way of stating which languages a journal uses. There's obvious advantages to this: it'll make it easier to find non-anglophone areas of Dreamwidth for those as want to, as a first or second or third etc language. This is a feature we're going to implement: what we want to know is how. (For those of you who are interested, there's a fair bit of discussion going on in the Bugzilla comments, but it's all repeated here.)

This post has been separated out into the separate areas that need refinement. In each section we outline our current thoughts: we'd love it if you gave us feedback on them, and we'd love it even more if you came up with an obvious better solution we've not thought of yet. :-)

The areas in question are:
  1. Languages field: usage
  2. Language entry options
  3. What does our standardised list look like?
  4. How do we choose the languages on our standardised list?
  5. How do we organise the list?
  6. Any Other Business


1. Languages field: usage


The original suggestion was for a field allowing people to state which languages they update in. Users are quite likely to enter languages they read in any such field: we need to think about how to handle this.

Options we can think of are:
  1. word the legend very carefully, and accept that the field will sometimes be used inaccurately
  2. provide separate "writes in" and "reads" boxes under an overall "Languages" heading
  3. if we go with 2, optionally include a "reads is the same as writes" (or "writes is the same as reads") tickybox, to reduce work for the user


2. Language entry options


Users will want to enter more than one language; therefore, this needs to be possible.

Ideally we would provide a standardised list so that we did not end up with confusion between "French", "french", "francais", "français" etc all pointing to different places, as they currently do if listed in interests.

However, this means we need to think about how to present this list. Some ideas that have been tossed around so far:
  1. provide check-boxes, possibly making the languages section collapsible. Advantage: easiest to select multiple options. Disadvantage: takes up a lot of use on the Manage Profile page, and will probably be edited much less frequently than e.g. interests or bio.
  2. provide a single drop-down, with the option to "add another" to produce another drop-down.
  3. provide a free-text box that behaves like the "Tags" field in the Create Entries page: it is possible to type in anything that you like, but you will be given suggestions from a standardised list and you'll be able to click "browse" to be shown the full list with check-boxes.
  4. implement one of (1) or (2), and provide a link reading something like "your language not listed?", which will reveal a free-text box for languages not on our standardised list.


Combinations of the above are naturally also possible, and we're very open to better ideas!

3. What does our standardised list look like?


As discussed above, we would like a standardised list. This gives us two problems: what our standard should be.

In comments in [site community profile] dw_suggestions, it was suggested that we use BCP-47 international language tags, wherein, for instance, "en-GB" is British English and "ta-SG" is Singaporean Tamil. It's also possible to specify scripts, allowing people to distinguish whether they're writing ru-Latn (Russian in Latin script) or ru-Cyrl (Russian in Cyrillic script).

Using only the BCP-47 tags is somewhat opaque, but they do allow a way to be very specific -- while also potentially allowing translation.

One possible model would be associating the BCP-47 tag names and the language names or descriptions in a more readable form, i.e. en would be associated with the string "English". (The ideal would be to make these associations such that if translations of Dreamwidth occur, it would be trivial to generate a file of language names in the target language while preserving the associations, e.g. "Anglais" would automatically map to en.)

We also need to consider whether it would be useful to allow people to restrict their searches further. For example, a user searching for "de" (German) would have all results returned, including de-AT (Austrian German), de-CH (Swiss German), de-DE (standard German), de-1996 (German pre-spelling-reform), etc - but do we want to allow these subtags, and therefore allow people to narrow their searches for only people writing in Swiss German?

4. How do we choose the languages on our standardised list?


Of course, the biggie - and the one we'd most like input on - is how to choose the "seed list" languages in the first place. This is the area where - we think - we're most likely to muck up and (at best) create an unhelpful list, so it's where we'd most like your input.

Methods currently under consideration (as ever, please suggest more):
  1. Grab the top 15 countries from Dreamwidth's usage statistics, and use their official languages as the seed list. Short and sweet. Possibly too short; privileges English over native, regional, and indigenous languages in countries that were colonised (New Zealand is the notable exception to this). Would result in ~25 languages on the list.
  2. Grab the top 15 countries from Dreamwidth's usage statistics, and use their official languages and their recognised regional languages as the seed list. This would mean that for, say, the UK, in addition to English there would also be the options Irish, Scottish Gaelic, Ulster Scots, Welsh, Cornish, etc. This list would be much, much longer (probably 50-150 languages on the list).
  3. hold a poll in a [site community profile] dw_news post, and populate the seed list with any language that gets more than n votes (in which case, what should n be?)
  4. what kind of suggestions system should we have for adding new languages to the standard list?
  5. some combination of the above with the top 15 global languages, which, while it adds length, has the advantage that we're not privileging English quite so ridiculously.


5. How do we organise the list?


  • by language tag? i.e. en-GB, en-US, en-CA..?
  • by language name? i.e. English, French, Japanese, Tagalog...?
  • by both? i.e. English (en) --> en-GB/British English, en-CA/Canadian English; French (fr) --> fr-CA/Canadian French, fr-FR...?
  • by country? i.e. Canada: English, French; Singapore: English, Tamil, Chinese, Malay?


6. Any Other Business?


We're sure there's things we're forgetting - these questions are only what's come out of two people thinking about this on-and-off for two days, and more brains is better brains! Please, please let us know what we're missing, and let us know what you think the correct course of action among the options listed above should be.

[personal profile] rho 2012-06-20 03:02 am (UTC)(link)
Have an enormous list of languages. Display the top n. Then have an option for bringing up the full list if your language isn't displayed in the top n. Have the top n determined dynamically by the n languages that have the most users already.

Advantages:
Doesn't really matter what your main seed languages are, as they'll be overwritten by actual use.
Can accommodate and acknowledge a sudden influx of people keeping journals in, eg, Navajo without any administrative insight needed.
Is simple to use for most users, and actually possible to use for everyone.

Disadvantages:
Extra DB load to determine what the top n languages actually are. (Which I don't think would amount to much, but IANA DB engineer).
You'd have to make sure that the enormous list of languages was pretty damn comprehensive, which could be difficult.

[personal profile] delladea 2012-06-20 03:11 am (UTC)(link)
Have an enormous list of languages. Display the top n. Then have an option for bringing up the full list if your language isn't displayed in the top n. Have the top n determined dynamically by the n languages that have the most users already.

+1
montuos: cartoon portrait of myself (Default)

[personal profile] montuos 2012-06-20 03:25 am (UTC)(link)
+1

Good idea!
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

[personal profile] pne 2012-06-20 04:52 am (UTC)(link)
+1
dhobikikutti: earthen diya (Default)

[personal profile] dhobikikutti 2012-06-20 06:29 am (UTC)(link)
Yes, this. Because I always resent the lists that just have English, but but not Indian English.
azurelunatic: A glittery black pin badge with a blue holographic star in the middle. (Default)

[personal profile] azurelunatic 2012-06-20 08:51 am (UTC)(link)
I like this, with a side order of write-in if it's still not listed.

If Very Popular write-ins could be promoted to formal inclusion, this would help mitigate any problems with an insufficiently comprehensive list of languages. The process could be semi-automated to potentially filter out the "Baige" effect (referencing the XKCD color survey where that was one of the most uniquely masculine responses, the other four neither being colors nor appropriate in context, or indeed at all). Having a certain threshold for popularity (especially percentage) could help avoid it becoming a leaky canoe of a labor timesuck.

Top n language determination could be recalcuated something like weekly, I think, which would keep the advantage of checking against actual reality, but remove the potential for real-time calculation overhead.
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

[personal profile] pne 2012-06-20 09:43 am (UTC)(link)
I like this, with a side order of write-in if it's still not listed.

Agreed - since we can't think of everything people would want to include.

(Such as "country-language" combinations such as the aforementioned "Indian English", which are representable in IETF language tags but as a combination of subtags rather than a single, language subtag.)

filter out the "Baige" effect

This would be necessary, too; I agree - some sort of human filtering and/or consoliation should go on before a write-in is promoted to formal inclusion. Or perhaps even before being displayed, with the write-ins being more like suggestion than something that shows up in search or profile immediately.

FWIW, I’d volunteer to help with the language filtering if such a scheme were implemented.
azurelunatic: A glittery black pin badge with a blue holographic star in the middle. (Default)

[personal profile] azurelunatic 2012-06-20 09:50 am (UTC)(link)
I think a write-in should be able to go on the profile immediately, since it's that person's own profile and it doesn't make any sense to delay it there.

Since there's no need for human review on interest searches, and those are site-wide searchable, I don't see a need to cause any delay before making write-in languages site-wide searchable either.
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

[personal profile] pne 2012-06-20 10:37 am (UTC)(link)
OK, makes sense.

Though I wonder what happens if their write-in gets officialised but differently—either merged with an existing language (say, someone put in “francais” which got merged with “fr – French”), what should happen to their profile? Should it continue showing the write-in (because that’s what they chose) or should it show the “official” version (because that’s what’s easier to search for)?

If the write-in gets officialised as-is, I think there’s no problem.
naraht: (Default)

[personal profile] naraht 2012-06-20 11:21 am (UTC)(link)
This is almost getting into an AO3 tags situation, where you have canonical tags and other tags which still display but are "synned" to the canonical ones for the purposes of searching.

...It's enormously complex to actually operate.
subluxate: Sophia Bush leaning against a piano (Default)

[personal profile] subluxate 2012-06-20 07:09 pm (UTC)(link)
+1

[personal profile] swaldman 2012-06-20 10:22 am (UTC)(link)
This, but it should also be possible for people to add new languages, because no pre-determined list will ever cover all options.

For instance, the the list in the OP mentioned "recognised regional languages" for the UK and included Scots Gaelic, but did not include Scots. Some say that's simply a dialect of English, some say it's a language, but regardless the point is that some people will want to put languages that we don't anticipate[1], and I think that in the general spirit of DW they should be able to.

As soon as people are allowed to add things, then we get straight back to the issue of how to prevent duplicates. I don't know the solution to that; I suspect there isn't a perfect one, short of asking people to request new languages through a support ticket, but because most users will only ever want to add a new language once I think it's reasonable to ask them to jump through some hoops.


[1] This spawns another question of fictional languages. I'm sure there must be at least one user out there who posts in Klingon or Tolkein Elvish. My gut feeling is "sure, why not?".
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

[personal profile] pne 2012-06-20 10:39 am (UTC)(link)
I'm sure there must be at least one user out there who posts in Klingon or Tolkein Elvish.

If they went with IETF language tags or something else based on ISO 639, they’re covered: tlh Klingon; sjn Sindarin; qya Quenya :)

But your point still stands, of course; there’s no code (yet?) for Na'vi or Dothraki, for example.
tiferet: cute girl in pink dress captioned "not all bad girls wear black" (Default)

[personal profile] tiferet 2012-07-02 10:48 pm (UTC)(link)
Is there enough of Na'vi or Dothraki extant to write in it? The thing about Klingon, Quenya & Sindarin is that enough of the language actually exists that it is possible to speak, read and write it. Don't forget Laadan (yes I know there's an accent in there but this machine won't do it in this programme).
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

[personal profile] pne 2012-07-03 07:02 am (UTC)(link)
Is there enough of Na'vi or Dothraki extant to write in it?

I don’t know. At a guess, I suspect there might be more Na'vi around (publicly) than Dothraki.

The thing about Klingon, Quenya & Sindarin is that enough of the language actually exists that it is possible to speak, read and write it.

*nods* AFAIK, one of the criteria for receiving an ISO code is that it must have received a certain amount of use (preferably by having a certain amount of literature in the language, though of course not all languages are written).

Don't forget Laadan (yes I know there's an accent in there but this machine won't do it in this programme).

That one has an ISO code :) ldn = Láadan. (Since 2009, according to the list.)
thnidu: Tom Baker's Dr. Who, as an anthropomorphic hamster, in front of the Tardis. ©C.T.D'Alessio http://tinyurl.com/9q2gkko (Dr. Whomster)

Language codes

[personal profile] thnidu 2012-07-18 02:26 pm (UTC)(link)
IETF codes can be much more complex and involve many optional components, potentially leading to effort wasted on whether or not a particular option is necessary. They may have been more stable than ISO 639-3; I do not know if future stability is guaranteed for ISO 639-3.

As a linguistic researcher at the Linguistic Data Consortium UPenn, I work with ISO 639-3 codes every day. They are maintained by the Ethnologue Languages of the World site. Their language name index page is headed
  Listing of 7413 primary names only.
  For 41,186 alternate names and dialect names use the site search.

Each of those codes specifies a single language -- or sometimes a dialect or group of dialects, since Nature doesn't draw thick sharp lines the way our naming customs pretend she did. Multiple alternate names can be linked to a single code--
  • Galician
    A language of Spain
    ISO 639-3: glg
    Population 3,170,000 in Spain (1986). Population total all countries: 3,185,000.
    Region Northwest Spain, Galicia Autonomous Region. Also in Portugal.
    Language map Portugal and Spain
    Alternate names Galego, Gallego

and homonymous names can be distinguished by the code:
  • Romani, Carpathian
    A language of Czech Republic
    ISO 639-3: rmc
    Also spoken in:
    • Poland
      Language name Romani, Carpathian
      Dialects Galician, Transylvanian.
    • Romania
      Language name Romani, Carpathian
      Dialects Galician, Transylvanian.


Furthermore, the ISO 639-3 codes can be extended for dialects and subgroups. The LINGUIST LIST maintains many of the extensions, as well as codes for additional extinct, ancient, historic, and constructed languages.
Edited 2012-07-18 14:27 (UTC)
pne: A picture of a plush toy, halfway between a duck and a platypus, with a green body and a yellow bill and feet. (Default)

Re: Language codes

[personal profile] pne 2012-07-18 05:17 pm (UTC)(link)
An impressive number of codes, yet I can’t seem to find one for “British English” in general—the codes I found were either too broad (e.g. “English”) or too narrow (e.g. “Scouse”). But even if you take a narrow option, I saw nothing for “RP” or “General American” at http://multitree.org/codes/by-letter/e.html.
pseudomonas: (Default)

[personal profile] pseudomonas 2012-06-20 11:34 am (UTC)(link)
For "languages that we don't anticipate", I think "file a request to add" would be reasonable, so long as the procedure to do so isn't too slow or too opaque.
kaberett: Overlaid Mars & Venus symbols, with Swiss Army knife tools at other positions around the central circle. (Default)

[personal profile] kaberett 2012-06-21 11:20 am (UTC)(link)
Hum. Gut reaction: we should sort the top-n alphabetically rather than by popularity, to make ease of finding.
green_knight: (Default)

[personal profile] green_knight 2012-06-21 09:00 pm (UTC)(link)
That might occasionally inconvience people who are used to not being inconvenienced, but I'd say this is the most democratic solution.
glinda: Emma Peel aiming a gun with the text 'sgoinneil' (avenging sgoinneil)

[personal profile] glinda 2012-07-02 10:18 pm (UTC)(link)
If this is practically feasible then this option would get my vote.
inthetatras: (megane Atobe)

[personal profile] inthetatras 2012-07-03 03:57 am (UTC)(link)
"Have an enormous list of languages. Display the top n. Then have an option for bringing up the full list if your language isn't displayed in the top n. Have the top n determined dynamically by the n languages that have the most users already."

This.

[personal profile] thomasneo 2012-07-03 09:41 am (UTC)(link)
+1

Whatever you do, please don't put Singlish and Singapore English together. They are two very different languages.
Edited 2012-07-04 09:27 (UTC)