dw_biz | RFC: Specifying languages in profiles: how should we do it?

RFC: Specifying languages in profiles: how should we do it?

This entry is being posted on behalf of the programmer who is working on a bug that sprang from a suggestion to make it easy for people to find other people blogging in languages they speak. (They wrote it; I'm just posting it! I don't want to take credit for any of this.) We're looking for some thoughts on our current ideas and want to make sure we aren't missing something super obvious :)

Turning it over now:

Way way back in 2009, it was suggested that we create a way of stating which languages a journal uses. There's obvious advantages to this: it'll make it easier to find non-anglophone areas of Dreamwidth for those as want to, as a first or second or third etc language. This is a feature we're going to implement: what we want to know is how. (For those of you who are interested, there's a fair bit of discussion going on in the Bugzilla comments, but it's all repeated here.)

This post has been separated out into the separate areas that need refinement. In each section we outline our current thoughts: we'd love it if you gave us feedback on them, and we'd love it even more if you came up with an obvious better solution we've not thought of yet. :-)

The areas in question are:

Languages field: usage
Language entry options
What does our standardised list look like?
How do we choose the languages on our standardised list?
How do we organise the list?
Any Other Business

1. Languages field: usage

The original suggestion was for a field allowing people to state which languages they update in. Users are quite likely to enter languages they read in any such field: we need to think about how to handle this.

Options we can think of are:

word the legend very carefully, and accept that the field will sometimes be used inaccurately
provide separate "writes in" and "reads" boxes under an overall "Languages" heading
if we go with 2, optionally include a "reads is the same as writes" (or "writes is the same as reads") tickybox, to reduce work for the user

2. Language entry options

Users will want to enter more than one language; therefore, this needs to be possible.

Ideally we would provide a standardised list so that we did not end up with confusion between "French", "french", "francais", "français" etc all pointing to different places, as they currently do if listed in interests.

However, this means we need to think about how to present this list. Some ideas that have been tossed around so far:

provide check-boxes, possibly making the languages section collapsible. Advantage: easiest to select multiple options. Disadvantage: takes up a lot of use on the Manage Profile page, and will probably be edited much less frequently than e.g. interests or bio.
provide a single drop-down, with the option to "add another" to produce another drop-down.
provide a free-text box that behaves like the "Tags" field in the Create Entries page: it is possible to type in anything that you like, but you will be given suggestions from a standardised list and you'll be able to click "browse" to be shown the full list with check-boxes.
implement one of (1) or (2), and provide a link reading something like "your language not listed?", which will reveal a free-text box for languages not on our standardised list.

Combinations of the above are naturally also possible, and we're very open to better ideas!

3. What does our standardised list look like?

As discussed above, we would like a standardised list. This gives us two problems: what our standard should be.

In comments in

dw_suggestions, it was suggested that we use BCP-47 international language tags, wherein, for instance, "en-GB" is British English and "ta-SG" is Singaporean Tamil. It's also possible to specify scripts, allowing people to distinguish whether they're writing ru-Latn (Russian in Latin script) or ru-Cyrl (Russian in Cyrillic script).

Using only the BCP-47 tags is somewhat opaque, but they do allow a way to be very specific -- while also potentially allowing translation.

One possible model would be associating the BCP-47 tag names and the language names or descriptions in a more readable form, i.e. en would be associated with the string "English". (The ideal would be to make these associations such that if translations of Dreamwidth occur, it would be trivial to generate a file of language names in the target language while preserving the associations, e.g. "Anglais" would automatically map to en.)

We also need to consider whether it would be useful to allow people to restrict their searches further. For example, a user searching for "de" (German) would have all results returned, including de-AT (Austrian German), de-CH (Swiss German), de-DE (standard German), de-1996 (German pre-spelling-reform), etc - but do we want to allow these subtags, and therefore allow people to narrow their searches for only people writing in Swiss German?

4. How do we choose the languages on our standardised list?

Of course, the biggie - and the one we'd most like input on - is how to choose the "seed list" languages in the first place. This is the area where - we think - we're most likely to muck up and (at best) create an unhelpful list, so it's where we'd most like your input.

Methods currently under consideration (as ever, please suggest more):

Grab the top 15 countries from Dreamwidth's usage statistics, and use their official languages as the seed list. Short and sweet. Possibly too short; privileges English over native, regional, and indigenous languages in countries that were colonised (New Zealand is the notable exception to this). Would result in ~25 languages on the list.
Grab the top 15 countries from Dreamwidth's usage statistics, and use their official languages and their recognised regional languages as the seed list. This would mean that for, say, the UK, in addition to English there would also be the options Irish, Scottish Gaelic, Ulster Scots, Welsh, Cornish, etc. This list would be much, much longer (probably 50-150 languages on the list).
hold a poll in a dw_news post, and populate the seed list with any language that gets more than n votes (in which case, what should n be?)
what kind of suggestions system should we have for adding new languages to the standard list?
some combination of the above with the top 15 global languages, which, while it adds length, has the advantage that we're not privileging English quite so ridiculously.

5. How do we organise the list?

by language tag? i.e. en-GB, en-US, en-CA..?
by language name? i.e. English, French, Japanese, Tagalog...?
by both? i.e. English (en) --> en-GB/British English, en-CA/Canadian English; French (fr) --> fr-CA/Canadian French, fr-FR...?
by country? i.e. Canada: English, French; Singapore: English, Tamil, Chinese, Malay?

6. Any Other Business?

We're sure there's things we're forgetting - these questions are only what's come out of two people thinking about this on-and-off for two days, and more brains is better brains! Please, please let us know what we're missing, and let us know what you think the correct course of action among the options listed above should be.

Flat | Top-Level Comments Only

On the subject of how to choose the seed list languages, I recommend against #4, where you go with the top 15 global languages. While I understand about not wishing to over-privilege English, I think it makes more sense to draw from the most common languages people on DW actually use (whether you determine that by a poll or some other method).

While it may well be that those top 15 global languages intersect nicely with the languages used by DW users, it also might not, so I say go with what's used, not with a Meta-Planet-Earth approach.

(uses crop circle icon to represent extra-terrestrial viewpoint)

I would say that anyone who has ever set up a computer or phone and installed programs on it, has probably been asked to choose from the international language tags at least once.

As long as it's not something you have to select to get your account up and running, it's good to have that level of specificity.

With selecting languages, wouldn't it make more sense to use a option similar to filters instead of ticky boxes? Where you can mass select by pressing ctrl or shift, but it doesn't take as much space

Don't do sub languages, and regional languages, unless by popular request. I think it will make this list too long and it won't be used. (there's 350-700 austalian aboriginal language families). I think periodically checking the freetext values is a pain, but highly desirable.

I wouldn't rely on dreamwidth countries of origin, due to the migratory aspect of the world, although I think those languages should be prioritized.there are unlikely to be many people from china, but there are likely to be many whocan read/write in simplified or traditional Chinese.

Dynamic controls for selecting languages that allow for drilling through regions to languages, i.e. South East Asia > Cambodia > Khmer | French | Cambodian French ?

This has the potential to be problematic. For example, I never quite know where people are going to categorize my country (Israel), and when you start going by "region" and not merely continent then i'd imagine other people may shift uncomfortably as well.

Yipe, good point and well put. Thank you.

Even going by continent can get you a few different answers for where Israel is, IME ;-)

That was precisely the point. Israel is alternatively assigned to Asia, Europe, Middle East and Other - I have, in fact, seen all of these around the web, and there are at least two of these I consider uninformed nearly to the point of offensiveness.

No, I do not want country-by-"geography" on DW. I really don't.

Ah, sorry, re-reading now I see that this is what you said.
I had read it as "continent is OK, but .il gets mis-categorised on "region". Wheras what you actually wrote was that .il gets mis-categorised even on "continent", and that many other places would with "region".

Apologies.

Have an enormous list of languages. Display the top n. Then have an option for bringing up the full list if your language isn't displayed in the top n. Have the top n determined dynamically by the n languages that have the most users already.

Advantages:
Doesn't really matter what your main seed languages are, as they'll be overwritten by actual use.
Can accommodate and acknowledge a sudden influx of people keeping journals in, eg, Navajo without any administrative insight needed.
Is simple to use for most users, and actually possible to use for everyone.

Disadvantages:
Extra DB load to determine what the top n languages actually are. (Which I don't think would amount to much, but IANA DB engineer).
You'd have to make sure that the enormous list of languages was pretty damn comprehensive, which could be difficult.

+1

Good idea!

Yes, this. Because I always resent the lists that just have English, but but not Indian English.

I like this, with a side order of write-in if it's still not listed.

If Very Popular write-ins could be promoted to formal inclusion, this would help mitigate any problems with an insufficiently comprehensive list of languages. The process could be semi-automated to potentially filter out the "Baige" effect (referencing the XKCD color survey where that was one of the most uniquely masculine responses, the other four neither being colors nor appropriate in context, or indeed at all). Having a certain threshold for popularity (especially percentage) could help avoid it becoming a leaky canoe of a labor timesuck.

Top n language determination could be recalcuated something like weekly, I think, which would keep the advantage of checking against actual reality, but remove the potential for real-time calculation overhead.

I like this, with a side order of write-in if it's still not listed.

Agreed - since we can't think of everything people would want to include.

(Such as "country-language" combinations such as the aforementioned "Indian English", which are representable in IETF language tags but as a combination of subtags rather than a single, language subtag.)

filter out the "Baige" effect

This would be necessary, too; I agree - some sort of human filtering and/or consoliation should go on before a write-in is promoted to formal inclusion. Or perhaps even before being displayed, with the write-ins being more like suggestion than something that shows up in search or profile immediately.

FWIW, I’d volunteer to help with the language filtering if such a scheme were implemented.

I think a write-in should be able to go on the profile immediately, since it's that person's own profile and it doesn't make any sense to delay it there.

Since there's no need for human review on interest searches, and those are site-wide searchable, I don't see a need to cause any delay before making write-in languages site-wide searchable either.

OK, makes sense.

Though I wonder what happens if their write-in gets officialised but differently—either merged with an existing language (say, someone put in “francais” which got merged with “fr – French”), what should happen to their profile? Should it continue showing the write-in (because that’s what they chose) or should it show the “official” version (because that’s what’s easier to search for)?

If the write-in gets officialised as-is, I think there’s no problem.

This is almost getting into an AO3 tags situation, where you have canonical tags and other tags which still display but are "synned" to the canonical ones for the purposes of searching.

...It's enormously complex to actually operate.

This, but it should also be possible for people to add new languages, because no pre-determined list will ever cover all options.

For instance, the the list in the OP mentioned "recognised regional languages" for the UK and included Scots Gaelic, but did not include Scots. Some say that's simply a dialect of English, some say it's a language, but regardless the point is that some people will want to put languages that we don't anticipate[1], and I think that in the general spirit of DW they should be able to.

As soon as people are allowed to add things, then we get straight back to the issue of how to prevent duplicates. I don't know the solution to that; I suspect there isn't a perfect one, short of asking people to request new languages through a support ticket, but because most users will only ever want to add a new language once I think it's reasonable to ask them to jump through some hoops.

[1] This spawns another question of fictional languages. I'm sure there must be at least one user out there who posts in Klingon or Tolkein Elvish. My gut feeling is "sure, why not?".

I'm sure there must be at least one user out there who posts in Klingon or Tolkein Elvish.

If they went with IETF language tags or something else based on ISO 639, they’re covered: tlh Klingon; sjn Sindarin; qya Quenya :)

But your point still stands, of course; there’s no code (yet?) for Na'vi or Dothraki, for example.

Is there enough of Na'vi or Dothraki extant to write in it? The thing about Klingon, Quenya & Sindarin is that enough of the language actually exists that it is possible to speak, read and write it. Don't forget Laadan (yes I know there's an accent in there but this machine won't do it in this programme).

Is there enough of Na'vi or Dothraki extant to write in it?

I don’t know. At a guess, I suspect there might be more Na'vi around (publicly) than Dothraki.

The thing about Klingon, Quenya & Sindarin is that enough of the language actually exists that it is possible to speak, read and write it.

*nods* AFAIK, one of the criteria for receiving an ISO code is that it must have received a certain amount of use (preferably by having a certain amount of literature in the language, though of course not all languages are written).

Don't forget Laadan (yes I know there's an accent in there but this machine won't do it in this programme).

That one has an ISO code :) ldn = Láadan. (Since 2009, according to the list.)

IETF codes can be much more complex and involve many optional components, potentially leading to effort wasted on whether or not a particular option is necessary. They may have been more stable than ISO 639-3; I do not know if future stability is guaranteed for ISO 639-3.

As a linguistic researcher at the Linguistic Data Consortium UPenn, I work with ISO 639-3 codes every day. They are maintained by the Ethnologue Languages of the World site. Their language name index page is headed
Listing of 7413 primary names only.
For 41,186 alternate names and dialect names use the site search.
Each of those codes specifies a single language -- or sometimes a dialect or group of dialects, since Nature doesn't draw thick sharp lines the way our naming customs pretend she did. Multiple alternate names can be linked to a single code--

Galician
A language of Spain
ISO 639-3: glg
Population 3,170,000 in Spain (1986). Population total all countries: 3,185,000.
Region Northwest Spain, Galicia Autonomous Region. Also in Portugal.
Language map Portugal and Spain
Alternate names Galego, Gallego

and homonymous names can be distinguished by the code:

Romani, Carpathian
A language of Czech Republic
ISO 639-3: rmc
Also spoken in:
- Poland
  Language name Romani, Carpathian
  Dialects Galician, Transylvanian.
- Romania
  Language name Romani, Carpathian
  Dialects Galician, Transylvanian.

Furthermore, the ISO 639-3 codes can be extended for dialects and subgroups. The LINGUIST LIST maintains many of the extensions, as well as codes for additional extinct, ancient, historic, and constructed languages.

Edited 2012-07-18 14:27 (UTC)

An impressive number of codes, yet I can’t seem to find one for “British English” in general—the codes I found were either too broad (e.g. “English”) or too narrow (e.g. “Scouse”). But even if you take a narrow option, I saw nothing for “RP” or “General American” at http://multitree.org/codes/by-letter/e.html.

For "languages that we don't anticipate", I think "file a request to add" would be reasonable, so long as the procedure to do so isn't too slow or too opaque.

Hum. Gut reaction: we should sort the top-n alphabetically rather than by popularity, to make ease of finding.

That might occasionally inconvience people who are used to not being inconvenienced, but I'd say this is the most democratic solution.

If this is practically feasible then this option would get my vote.

"Have an enormous list of languages. Display the top n. Then have an option for bringing up the full list if your language isn't displayed in the top n. Have the top n determined dynamically by the n languages that have the most users already."

This.

+1

Whatever you do, please don't put Singlish and Singapore English together. They are two very different languages.

Edited 2012-07-04 09:27 (UTC)

1. Is it actually necessary to have a "reads" setting? Perhaps I'm being dense here, but when the goal is to find languages written, I can't think of a good reason to record languages read, other than to have an "exclude languages not on my reads-list" search. I think maybe simply labelling it "Posting language" or something like that might suffice.

2. I really like the tag-style entry box idea. Unless you typo egregiously, the search function works really well to pull up anything even remotely similar.

3. I really like the idea of using the BCP-47 tags, as long as they are used in conjunction with actual human-readable language names. The completist data-wrangler in me always desires being able to sort with the highest granularity, although I can see a certain utility in restricting to language families.

4. I think holding a poll would be the best way to find out which languages are preferred and posted in by actual users. And given that English is the world's greatest second language, I think English should simply be assumed to be used, and ignored for purposes of this poll. Maybe take the official languages of the top 15 DW-stats countries plus the top 15 global languages for a ticky-box poll, plus a text box for additional write-in votes?

5. I think it would be a kindness to list languages alphabetically by their names in that language, e.g. English (en), Español (es), Français (fr), etc. (optionally putting language tags after the language names). I acknowledge that it would probably be more of a pain to do it that way, though.

6. It occurs to me that it might be a good idea to take the language from the browser headers and use that as the default unless and until people choose to set their language(s).

re 1) It could be useful to know for people to comment with the journal owner in a language other than the one the entry is written in. For example I post in English for the benefit of international readers, but my native language is German. That is not readily apparent from my profile, which is focused on content. If another German-speaker who reads English and follows my journal felt more comfortable commenting in German because they spoke that better, they could see that I have no trouble reading that, whereas now they might decide to forgo communicating in a foreign language.

Yes, this.

This. It might also make it slightly less creepy when someone like me - who does not speak Russian and thus never comments on Russian-language journals - subscribes to them. I can't respond, but I would be comfortable with someone leaving Russian comments in my journal. (I might need help understanding them... but I can make out some of it.

The language option could be something like 'this journal is written in XXX' to make it clear what you're asking for.

1. How about doing it backwards? Have the primary indication be all the languages one uses/is conversant in, with the option to tick (or similarly indicate) languages in which one actively blogs. Languages are a matter of identity, that's why people will always indicate the langauges they speak/are conversant in/are trying to study.

e.g. I rarely blog in Hebrew - but that i'm a (native) Hebrew speaker is important information for potential subscribers, and I know I have a significant readership that showed up at my DW pretty much for these reasons. (Much to my initial surprise, I should say.)

2. I rather prefer option #3, the tag-like one. Inclusive and preserves screen real estate.

3. I would very strongly prefer to not have something like those standardized tags we user-visible as a default. I disagree with the user who says that "anyone who has ever set up a computer or phone and installed programs on it, has probably been asked to choose from the international language tags at least once." I've been the effective first-line PC techie for family and workplaces and I have never seen that list. I wouldn't even reliably recognize Hebrew on it, and I suspect people from other non-Latin-script languages may have similar issues. It also, well, doesn't have a good feel: "We couldn't figure out a human-friendly way to do this, so we went the computer-friendly way."

5. I don't like the idea of anchoring language to country. One problem is Diaspora languages. Think of - oh - Arab diasporas in the US and the EU. Or Jewish diasporas anywhere. And I can think of several others off the top of my head. People would still be using their non-resident-country language, and searching for people based on that language, for identity reasons.

And so long as it's searchable by something easily human-recognizeable. The codes can be there; maybe it's easier for some people to scan for the codes. But coming from non-Latin-script languages, the codes can be... more effortful than this should be, particularly if you're trying to make it homey for non-English speakers.

I definitely agree with you on #1 - I'm Norwegean, but I mostly (always?) post in English because I can and I know I'll reach a lot more people if they feel like reading. However, I'd love for this to be a way for me to find other Norwegians on DW, and (like mentioned above somewhere) to let them know they could comment to my blog in any of those languages.

For the same reason I agree with #5 - I speak English, but I'm not from any English-speaking country. There are way too many people who live (permanently or for shorter periods) in other countries.

EDIT: I'm not sure where else to put this, as it's sort of connected to language identity but not entirely... would it be able to change the 'Location' field in the profile to include more than one country? And perhaps also include a 'Nationality' field if people want that? Or something else that would be less specific but indicate a connection to more than one country?

I'm thinking along the lines of identification here, like we're talking about with languages people may live in one country/place and consider themselves to be another nationality, and some might want to indicate that.

My own case is a little different but also relevant, as I study in one country and spend my summer/holidays in another, and at the moment I consider both those countries my home. It's strange to have to change my location back and fourth when I feel like I have a connection to both countries all the time, and it's also a bit of work. :~D

Edited 2012-07-04 13:04 (UTC)

As I read the original suggestion, this:

Have a field in which you select the language/s you speak; people speaking the same language can find each other more easily.

leads me to a different conclusion than this:

The original suggestion was for a field allowing people to state which languages they update in. Users are quite likely to enter languages they read in any such field: we need to think about how to handle this.

as it seems the OP was hoping to establish connections, not just with people who update in a particular language, but also speak a language.

Morphing it into a set of languages one updates is not necessarily a bad thing, but as others pointed out, languages spoken can definitely be an identity-level thing.

It might further aid the OP to have a way of setting what language a post is written in. This might be considered a future enhancement.

I like the idea of presenting the list as an unfolding set of checkboxes; keep it hidden at first, then unfold the more commonly used languages, then further expand to all languages; these should be clearly marked (emulating the cut tag feature would make for a consistent interface across the site).

In addition to the checkbox list, a write-in space should be allowed, as I can see people posting in conlangs, possibly (toki-pona springs to mind as something potential journalists might use). This might be a future enhancement as well.

I just noticed the odd way it formatted my ordered list of points there... is it supposed to do that?

If you mean the extra spaces between the blocks, you might want to choose "More Options" and then "Don't auto-format".

Comments automatically turn carriage returns into line breaks.

When Dreamwidth auto-formats your comment, it automatically adds a linebreak for every newline, which happens even when you use HTML tags.

You can disable it for a comment by clicking More Options and ticking the "Don't auto-format" box. Your comment is then interpreted without inserting newlines automatically (and URLs won't be automatically turned into links).

My preferred system, though, is to delete the newlines manually after doing the text.

Edited 2012-06-20 11:39 (UTC)

Ah, okay, that's good to know. I assumed it only did that when there were two consecutive line breaks. The other oddity was the numbering: 01., 02., etc -- the leading zero was initially what struck me as odd.

It's not doing that on my browser. Not sure why your browser might be doing it!

Somewhere in the theme hierarchy, this is set:

ol {
    list-style-type: decimal-leading-zero;
}

(Line 557 on the streamlined CSS that is emitted.)
I added a snippet of custom CSS that put it back to decimal.

Ah! I hadn't realised you were viewing this in your style. That makes sense!

Glad you got it sorted :D

1. I read it the say I did because of the title of the suggestion, "Specify blogging language in user profile." I'm well aware that languages spoken can be an identity-level thing; this is why I'm so keen for us to have "languages read" as well as "languages written in" (though I'm very open to having the precise wording of that changed!).

2. Mmm. I don't think that's something I'm going to try to shoehorn into this, but will stick it on my List Of Spec/talk to Fu about it/open a bug for it once I've got the first bit done.

Just a thought: would it be worth looking at how the AO3 does its language tagging?

I think it would - it might require some dedicated tag wranglers but it would present the user both with a standardised system (start writing 'eng' and 'English' is suggested) and the option to add new languages that aren't in the system (that could then be wrangled).

The AO3 system also connects all versions of the same tag; you could write 'English' of 'anglais' and both would link to the same thing. This might be a bit of work, but as there are a limited number of languages (and spelling mistakes etc) the workload should increase pretty quickly after the first manic implementation.

Be careful of what you wish for. ISO 693-6 "ang" = "Old Engish" (Anglo-Saxon).

This is a bit brainstormy but:

Can we tag posts as they're made, rather than users? So rather than searching for "a user who says they post in Dutch", you search for "a user who has written at least one post tagged as being in Dutch". Sensible provision of defaults should mean that tagging a post as "the same language as my last post" is zero effort, and "the same language as a post I've written recently" should be minimal effort.

on point three - languages should probably be as wide as possible, rather than including large numbers of mutually-intelligible regional variations. I can see why for various reasons people might want that latter info, but I don't think it'll help usability. Possibly allow tagging of variations, but search for them as the more general case.

Please, for the love of all that's holy, avoid conflating languages and countries, *especially* please avoid the use of national flags to indicate languages.

For posts of more than a couple of words, written in a single language, it may well be feasible to automatically identify the language. ISTR there are CPAN modules that do this already.

languages should probably be as wide as possible, rather than including large numbers of mutually-intelligible regional variations. I can see why for various reasons people might want that latter info, but I don't think it'll help usability. Possibly allow tagging of variations, but search for them as the more general case.

Good point: As a feature this will be less useful if everybody claims to be writing in a subtly different language to everybody else! But, people may want these variations as an identity thing.

Maybe some sort of approach along the lines of the mood-icon dropdown, where people can select a (broad) language from a list but then customise the text that's used to describe it? That way people can display what they want but the broader category is used for searching? I'm sure there would still be awkward edge-cases, but...

Edited 2012-06-20 11:44 (UTC)

I know people may want to use it to express their identity, but I think as far as possible we should be in the business of dealing with communication and intelligibility.

There are cases that may be more relevant to distinguish - so languages that can be written in various character sets (Azerbaijani in Latin characters vs Azerbaijani in Cyrillic characters vs Azerbaijani in Arabic characters) it may be well worth searching separately. But this might be something we need to find out by trial-and-error.

I agree, basically, with your analogy to mood-icons - let people display it however they like, but have metadata that's in a reasonably restricted format for ease of searching.

Edited 2012-06-20 11:51 (UTC)

Maybe some sort of approach along the lines of the mood-icon dropdown, where people can select a (broad) language from a list but then customise the text that's used to describe it?

That sounds valid.

Also posts should be able to be tagged with multiple languages, in case someone is switching back and forth or including text with a translation.

ISTR there are CPAN modules that do this already.

http://search.cpan.org/~ambs/Lingua-Identify-0.51/lib/Lingua/Identify.pm is one such.

Though according to the documentation it only "knows" 33 languages, which is really not a lot.

Indeed. It looks reasonably easy to train more languages, given suitable corpora, though.

I'm thinking a bit about question one. If the goal is to help people find content in their language, maybe this is a setting that should exist on a per-entry basis, with an account-level "default post language" setting. So, for instance, my account could default to English, but if I wrote an entry in Spanish I could toggle just that post over.

To get *super* fancy, maybe the account settings could include both "default language" and "options to include on my Create Entry page". That'd allow one comprehensive set, while making the day-to-day process of language selection for multilingual bloggers much simpler.

One more thought, on the ordering of languages on the list presented to everyone: I can see using an all-purpose ordering to start with, but once the feature is live, could it be sorted by frequency of use among Dreamwidth Users?

That bothers me - I'd prefer alphabetical (possibly by language-name-in-language, poss by language-name-in-English), because sorting by popularity will mean people have the very DEVIL of the time finding what they're looking for if they want to scroll through manually. Does that make sense?

You're right, it could get very unwieldy at a large scale.

What about defaulting to show, eg, top 15 languages by DW usage, and then having a "more" button that has all languages, alphabetically?

Instead of toggling, checkboxes would be nice for those of us who would use three or more languages in a post if we weren't worried about alienating our audience.

From the user viewpoint: if you ask me what language I post in, I'm going to say "English," not "US English." And not just because I'm not thinking in dialect terms, but because it's the wrong level of specificity: there's a bunch of New York in my English, along with some random UK usages I picked up from friends and reading. If I'm looking for things posted in English, I want it to find posts by people who define their English as Canadian or New Zealand or British, not just U.S. English. And if I wanted Spanish, I wouldn't have a preference for Dominican over Castilian or vice versa.

One doesn’t have to exclude the other; wanting to find text tagged as “US English” when searching for text in “English” is not unusual.

So users could be vague (just language) or more specific (e.g. adding a region, dialect, or orthography), while searches could either be encompassing (e.g. just language, which would also find more specifically-tagged variants) or narrow (which would only find specifically-tagged ones).

BCP-47 looks like a source that might be about as sane as you can get, and has the benefit that you don't have to try to maintain it yourself, and thereby get it wrong in a whole bunch of exciting and different ways. I'm not 100% sure how the list of sub languages are maintained, e.g. en-GB, en-AU definitely exist, but is en-SG valid and who decides? Those don't appear to be in that list, but I haven't researched how the sub tags work. Oddly en-GB-oed (grandfathered) and en-scouse (redundant) appear to be listed directly...

As far as a "seed list" goes - why? Don't have one. Have the entire BCP-47 list with sub-region tags (displayed as your "both" format above), a text box, and auto-suggest typing or a popup-selection once the typing is enough. Especially if you're doing it on the profile, this is a once off thing being done for each journal. Shortlisting from a large list is inherently ugly, and you don't need to optimise it away because this is essentially a rare operation. Don't fix a problem that isn't a problem. :-)

This thing should apply to individual posts/comments?, to identify the language(s) the post/comment is written in, if desired. After all, most of my journal is in en-AU, but I might want to write in pig-latin or tengwar (is tengwar a script only? I can't find a matching language tag?) for some reason.

On posts (and comments?), yes, you should have an short list - but you can pull that short list directly from the profile of the user doing the posting/commenting, and you know that's a useful shortlist because that particular human picked it.

Tengwar is a script, I believe; you can write Quenya and Sindarin in it, IIRC. (Also, the language of Mordor - remember the Ring?) Probably there are ways to write a lot of languages using Tengwar.

As I have Singaporean friends, I have to posit that en-SG is entirely valid; they have some manners of phrasing that most closely resemble en-GB (for obvious reasons), but the sentence structure can be very influenced by Chinese, and they have some ways of emphasising that I don't see anyone not from that general region using. It's entirely cross-understandable, but it is its own dialect, inasmuch as there are differences between English in the UK, US, and Australia/NZ/etc., and we're not touching sub-dialects because yes, we know. :P :)

Edited (Spelling, I can has it?) 2012-06-21 03:17 (UTC)

Tengwar is a script, I believe; you can write Quenya and Sindarin in it, IIRC. (Also, the language of Mordor - remember the Ring?) Probably there are ways to write a lot of languages using Tengwar.

*nods* A script. In addition to the languages you mention, there are also a couple of “modes” for writing English (used on the title page of LotR, for example, IIRC).

It even has an ISO 15924 script tag “Teng”, so you could mark an entry as being in sjn-Teng (Sindarin, written in Tengwar), or even en-AU-Teng (Australian English, written in Tengwar) (since BCP 47 uses ISO 15924 as one of its sources).

It even has an ISO 15924 script tag “Teng”, so you could mark an entry as being in sjn-Teng (Sindarin, written in Tengwar), or even en-AU-Teng (Australian English, written in Tengwar) (since BCP 47 uses ISO 15924 as one of its sources).

I didn't know that. How cool! /former tolklang nerd

Ah, and there we are:

%%
Type: language
Subtag: qya
Description: Quenya
Added: 2009-07-29
%%
Type: language
Subtag: sjn
Description: Sindarin
Added: 2009-07-29

of course. :-) I haven't refreshed my Tolkien brain cells in years, and couldn't remember what the languages were actually called.

And yes, en-SG should probably be valid, and maybe en-MY (my family is from Johor, just over the causeway, and plenty of bits of it are in SG)... but what about en-CN? en-TW? cmn-BR? :-)

I guess one should just allow any language-region-script-variant combinations that are in the list, and not restrict things...

I guess one should just allow any language-region-script-variant combinations that are in the list, and not restrict things...

+1

I'm not 100% sure how the list of sub languages are maintained, e.g. en-GB, en-AU definitely exist, but is en-SG valid and who decides? Those don't appear to be in that list, but I haven't researched how the sub tags work. Oddly en-GB-oed (grandfathered) and en-scouse (redundant) appear to be listed directly...

Short answer: the odd tags are there for historical reasons from before language tags were done the way they are now (old: single token; new: nearly unlimited composition of subtags); "grandfathered" ones can't be expressed/analysed identically in the new system (but many such tags have a "Preferred-Value" showing how they can be expressed differently) while "redundant" ones can be; en-AU is not in the list explicitly because it can be composed from the language subtag "en" and the region subtag "AU" that are in the list.

Long answer: see BCP 47 (e.g. http://www.rfc-editor.org/rfc/bcp/bcp47.txt ). Or email me and I'll share what I know, as this is something I'm interested in and have spent a bit of time on, but probably shouldn't be hashed out entirely in comments on this entry.

I checked out wikipedia and this reply and I am more enlightened. :-)

A very quick search engine poke doesn't reveal me any BCP-47 jQuery/javascript pickers out there, but I'm thinking there's definitely value in a well coded one. :-)

I'm thinking it may actually be worth hashing out some of this - this whole thing is starting to remind me of the X11 font picker, but I don't know enough about BCP47...

Ah. I begin to see some of the difficulties inherent when I skimmed past this part of BCP47...

   The choice of subtags used to form a language tag SHOULD follow these
   guidelines:

   1.  Use as precise a tag as possible, but no more specific than is
       justified.  Avoid using subtags that are not important for
       distinguishing content in an application.

       *  For example, 'de' might suffice for tagging an email written
          in German, while "de-CH-1996" is probably unnecessarily
          precise for such a task.

       *  Note that some subtag sequences might not represent the
          language a casual user might expect.  For example, the Swiss
          German (Schweizerdeutsch) language is represented by "gsw-CH"
          and not by "de-CH".  This latter tag represents German ('de')
          as used in Switzerland ('CH'), also known as Swiss High German
          (Schweizer Hochdeutsch).  Both are real languages, and
          distinguishing between them could be important to an
          application.

... So, yes, we do want to support the flexibility, but on the other hand, that means that people may do the wrong thing.

I begin to suspect that the right UI thing to do is definitely to have the full list available somehow as some kind of complicated picker on the user's profile page, including help text and/or FAQ link to common language choices, and then just allow people to choose from their own profile's shortlist as they post/comment (if they wish to).

I guess the majority of people will pick a top level language or three, and feel no need to delve into sub tags at all, so the picker should be built to support and encourage that behaviour.

People who want a region will probably be looking for it already.

Searching shouldn't be too hard - if you're searching for stuff in "en", "cmn", or "qya", then you should get all the sub tags, and if you're wanting specific sub tags then you can search for those.

I'm not entirely sure how the existing tag back end works for searching, but I imagine the BCP47 tags/sub tags could be integrated into that back end or a similar thing.

[Deleted for redundant question already asked/answered above in other comments.]

Edited 2012-06-22 03:38 (UTC)

I wonder if language picker page used may also want to suggest based on perceived location as well as top overall languages. I guess I'm envisioning a page that starts something like:

"Pick the language(s) you post and read in. Here are some ways to look for languages:"

[Two columns]

[Left column box] "Here are the top [n] languages currently used on Dreamwidth" [tickybox list follows]

[Right column box] "Based on the country in your Dreamwidth profile, here are the top [n] languages currently used by DW users in your country" [tickybox list follows]

Below that, have options for search (including to get the country by country results for countries other than the one in their profile)/full list/etc to find languages not covered in the two categories above. Between what's commonly used on DW as a whole and what's commonly used in their area, we can minimize how many users need to dig through the larger list and increase the odds that people who want to specify, say, British English will be offered that choice in the shorter tickybox list (since they are more likely to either be in Great Britain or to think to search for languages used in Great Britain as a way of locating it).

However, I have no idea how much more of a backend pain this would turn out to be, and I admit I wouldn't use this feature much since I mostly post in English and usually want to read other English postings.

1. Option two or three. it's useful for people to know what languages they can talk to a person in, not just what languages a person in blogging in. Would be nice if there were privacy settings for this -- I do not list my country publicly, but if I list languages I speak, my country would be more guessable, but I would like for people on my access list to be able to look at my profile and know what languages they can talk to me in.

2-6: no opinion, so long as a person who speaks unusual/rare languages can still pick their languages.

Users already have a field for location on the profile page. So if you wish to add a field for language(s), please don't categorise language according to country and region.

Flat | Top-Level Comments Only

RFC: Specifying languages in profiles: how should we do it?

1. Languages field: usage

2. Language entry options

3. What does our standardised list look like?

4. How do we choose the languages on our standardised list?

5. How do we organise the list?

6. Any Other Business?

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Language codes

Re: Language codes

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

spoken vs. updating languages

Re: spoken vs. updating languages (aside, OT)

Re: spoken vs. updating languages (aside, OT)

Re: spoken vs. updating languages (aside, OT)

Re: spoken vs. updating languages (aside, OT)

Re: spoken vs. updating languages (aside, OT)

Re: spoken vs. updating languages (aside, OT)

Re: spoken vs. updating languages (aside, OT)

Re: spoken vs. updating languages

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Don't have a seed list. And language tag(s) should be available on individual posts/comments

Re: Don't have a seed list. And language tag(s) should be available on individual posts/comments

Tengwar

Re: Tengwar

Re: Don't have a seed list. And language tag(s) should be available on individual posts/comments

Re: Don't have a seed list. And language tag(s) should be available on individual posts/comments

IETF language tags

Re: IETF language tags