Tuesday, November 24, 2015

Some more maps of language diversity

I made some maps of language diversity, to add to Hedvig's long lists of existing info graphics here and here, using data from Glottolog. 
I wanted to make a map showing the number of language families - 272 according to Glottolog, not counting sign languages, artificial languages, creoles, pidgins, and mixed languages.  Each curve shows the area in which languages in these families are found.  The colors were randomly selected in order to try making the curves distinct from each other, although in some areas these are difficult to make out, such as in southeast Asia.  
Try spotting in particular Indo-European, Turkic, Afro-Asiatic, Na-Dene, Quechua and Niger-Congo. Austronesian is the red triangle in the right hand corner, although my curve-plotting algorithm makes it incorrectly cover Australia as well.  
The maps are generated in R using the 'maps' package.  Here's a function 'drawcurve' in R for drawing curves of a particular colour around a set of longitudes and latitudes, taken from this stackoverflow page:

spline.poly <- function(xy, vertices, k=3, ...) {
  # Assert: xy is an n by 2 matrix with n >= k.
  # Wrap k vertices around each end.
  n <- dim(xy)[1]
  if (k >= 1) {
    data <- rbind(xy[(n-k+1):n,], xy, xy[1:k, ])
  } else {
    data <- xy
  }
  # Spline the x and y coordinates.
  data.spline <- spline(1:(n+2*k), data[,1], n=vertices, ...)
  x <- data.spline$x
  x1 <- data.spline$y
  x2 <- spline(1:(n+2*k), data[,2], n=vertices, ...)$y
  # Retain only the middle part.
  cbind(x1, x2)[k < x & x <= n+k, ]
}
drawcurve=function(longs,lats,colour){
  testpts=structure(list(x=longs,y=lats))
  chuld <- lapply(testpts,"[",chull(testpts))
 polygon(spline.poly(as.matrix(as.data.frame(chuld)),100),border=colour,lwd=2)
}

The second map shows density of languages.  The size of the star on each point is proportional to the number of languages inside a 100 km radius of the point that the star is on.  This goes up to 134 in parts of New Guinea.  By that scale, most of Europe is too small to register.  100 km is a bit of an arbitrary threshold, but I wanted to illustrate the extreme end of language diversity: the number of languages found inside a radius only slightly larger than the train ride from Nijmegen to Amsterdam (89 km).
This only includes indigenous languages.  For example, English is not counted as being spoken in the United States, or anywhere outside of England.  If all non-indigenous languages are counted, places like New York apparently have up to 800 languages (English being one of them).  In this map, there are no extant languages at all in New York, with the closest being the Algonquian languages of Delaware and Massachusetts.
The highest point of language diversity on the planet, by this measure, is at the Trans-New-Guinea language Kandawo, with 134 languages inside a 100 km radius.



(Kandawo images: map, photo)

  

Wednesday, November 11, 2015

The "Other Languages" in Ethnologue & Glottolog- isolates, contact languages, sign languages etc

Hello everyone,

In this post, we will tackle the following questions:
  • What is a language family? How many language families are there in the world?
  • What is an isolate? How many isolates are there?
  • How many sign languages might there be in the world?
  • How many contact languages might there be in the world?
  • What is the difference when it comes to these categories with respect to Glottolog and Ethnologue? Does it matter?
Hopefully this might be useful to people interested in language history and/or users of Glottolog and Ethnologue. As usual, I will use headings so feel free to skim and skip things that are not interesting to you and only read what you find relevant.



It's been few posts lately by me (Hedvig), will try and change that. Right now I'm out travelling in Europe, going to workshops and visiting friends and getting a lot of inspiration. Already been at two workshops: Capturing Phylogenetic Algorithms for Linguistics in Leiden and before that Jena and a workshop at the MPI-SHH on Glottobank. So, there's been even more a lot of language evolution than usual and that might be visible in up-coming posts :)!

Language trees in popular media

Often when we talk of language relationships, especially in popular media, we don't mention isolate languages (languages with no living relatives), contact languages, sign languages or unclassified/unattested languages. That's understandable, it takes a lot of context for that information to be useful. But, these are crucial parts of understanding the state and history of languages in the world.
Take for example this beautiful infographic to the right by Minna Sundberg that is very popular online and that regularly gets spread around - languages of those categories are not around. In this infographic we see a subset of the languages of the Indo-European and Uralic language family and their relationship beautifully illustrated. The creator does note that only languages relevant to her other work are represented, this is most sensible, so perhaps we need not pick on her really. However, in my experience this is quite representative of illustrations of language trees in popular media. The tricky categories are often just left out.

Let's start with definitions and some quick counts.

Language family = a group of languages that is hypothesised to share a common ancestor. Typically we use the term when we refer to the largest group, for example "Indo-European", but it is sometimes also used for sub-branches, like "Slavic". There's probably between 145-350 language families in the world, or possibly ultimately 1.

Isolate = a language without any living relatives, the most famous example being Basque. It might be that we know of the other members and that they have died out, or that we have never known of any relatives. Isolates are typically contested, often proposed to be part of one family or another. The counts of isolates range from 129 (Campbell, unpublished) to 83 (Ethnologue) or 189 isolates (Glottolog).

Sign Language = most often this term refers to the native language shared by a group of deaf people, often with hearing members as well. Sometimes it is also used for signed versions of spoken languages, which is more comparable to writing than to a native deaf sign language. To make this clearer, sometimes the term "deaf sign language" is used. In a way, many or perhaps all, sign languages are contact languages since they often have their origin in different deaf people coming together for the first time and having to communicate despite not sharing a language.. more on that some other time from a new writer (perhaps?). The estimate of the number of known deaf sign languages in the world ranges from 138 (Ethnologue) to 166 (Glottolog).

Contact languages =  There are several different types of contact language varieties and there are also many different definitions of these varieties. As Muysken & Smith (1995) writes "[c]reolists agree neither about the precise definition of the terms pidgin and creole, nor about the status of a number of languages that have been claimed to be pidgins and creoles". One definition of a contact language that is often cited is that by Thomason (1997:3): "a language that arises as a direct result of language contact and that comprises linguistic material which cannot be traced back primarily to a single source language". Contact languages include: pidgins, mixed languages and creoles. 

The number of known contact languages ranges from 131 (Ethnologue) to >136 (Glottolog). The number of languages in the largest typical survey of contact languages is 76 (APiCS), but they don' make any official evaluation of what is and what is not different types of contact languages.  Just to clarify, language that have had a lot of contact with other languages, such as English, are not necessarily creoles. Just being in contact with another group does not qualify alone, according to most contact linguists. This gets gritty - more on this later.

Top-genetic unit = language families, isolates and other top-level groupings in a language trees. When we get into thinking about language trees and isolates, contact languages, deaf language etc it's sometimes easier to use the term "top-genetic unit" as opposed to "language family" since they might be clustered together despite not sharing an ancestor or being its own closest shared ancestor (isolates). This matters, because different resources will differ in wether they group all isolates into one genetic unit (Ethnologue) or not (Glottolog). In this post, we will use this term to make things clearer.

How many language families? Where do they all come from?
Language history can be reconstructed in many different ways, and different data and methods can give radically different results. Some place more value on one kind of evidence, when others favour another. Deep time relationships are especially hard, since we have so little data to go on. If you want to see lots of different suggested language trees, go visit MultiTree!

Uralic and Indo-European are two of the most widely known language families of the world, but they are only 2 language families of the 141-350 language families (depending on who you go by) that exist today. As far as I know, the lowest estimate of language families in the world is given by Ethnologue, 141, and the highest by Campbell (2007) who give the highest approximate estimate of 350. If you know of more extreme published numbers, do send them my way

If we believe that language sprung out of one place and time only, (monogenesis), of course there is only 1 family. If we believe that language arose in different places around the same time (polygenesis), then there are more than 1. Both of these assumptions are based on the idea that trees of languages are a good representation of language history start with, as opposed to (1) some kind of model of historical networks or (2) trees of different parts of language (lexicon, phonology, grammar) that overplayed on each other shows a history of "language". This is comparable to how the historical relationships between different genes can build the history of a species, but not every gene will have the same history as the species. From the forest of overplayed trees of every gene emerges the tree of the species.

Speaking of genes and genesis, here's a map of the world projected as is most convenient when illustrating the human migration out of Africa. The arrows are based on mitochondrial (matrilineal) DNA. Image taken from Wiki commons, here.
As we talk about cultural and language evolution we often make comparisons with biological evolution, however always keep in mind that cultural material (memes) can spread without the spreading of biological material (genes).

How define what is and what is not a language?
Short answer: compare lexicon between different varieties and conduct test of mutual intelligibility, at a certain threshold level say that they are 1 language or 2 and apply this consistently across your sample/the world. Ideally get data at the individual level.
Longer answer: this, this and this post.

Comparison of Ethnologue and Glottolog 
First off, Ethnologue and Glottolog are different organisations - funded in different ways and with different relationships to different parts of academia and different goals. Ethnologue is run by SIL International, a faith based organisation that has invested greatly in description of the languages of the world through employing a swarm of field linguists. The Ethnologue is a catalogue for the public for looking up information about languages and countries, such as their endangerment level, where they're spoken and by how many etc. SIL International also holds the ISO 639-3 for language names, and many get access to these codes via Ethnologue. 

Glottolog is an Max Planck organisation that has inherited much data from Ethnologue but made many changes to it. Glottolog is more concerned with language classification, relationships and bibliographies. Glottolog is more a tool for the already educated linguist, to look up references for a language or compare trees to. Glottolog gives citations for every split in the genealogical trees, Ethnologue does not. 

Both are open for criticism, you can submit a complaint to both.

Thanks to all the editors of both catalogues: Harald Hammarström  & Robert Forkel & Martin Haspelmath, Sebastian Bank & Sebastian Nordhoff & Paul M Lewis & Gary F. Simons & Charls D Fennig & Barbara F. Grimes & Raymond G. Gordon, Jr & Richard S. Pittman. Brilliant work!

Ethnologue and Glottolog both contain information about language classification and their relationships. The first major comparison that can be made is just in pure terms of how many language they contain and how many families/top-genetic units. For Ethnologue, we get that information here and for Glottolog here. If you want to read the process by which Glottolog works out classifications, go here, for similar information for Ethnologue go here. I checked the data from the two catalogues most recently on november 11 2015. It might have changed since then. I have not bought data from SIL International (this is possible) and Glottolog is all free.

We get very difference numbers in this comparison if we exclude potentially tricky categories like contact languages, deaf sign languages, isolates and unclassified languages. Read the note under to see exactly what was subtracted in both cases. The number in parenthesis is the total number of languages in Ethnologue, including extinct ones.


Languages
Top-genetic units
Total Ethnologue
7,102 (7,472)
141
Total Glottolog
7,938
432
Ethnologue minus contact etc 1
6,716
135
Glottolog minus contact etc 2
7,288
239

1 Ethnologue. This number excludes constructed languages (1), creoles (88), deaf sign languages (137), language isolates, mixed languages (21), pidgins (13), and unclassified languages (51).

2 Glottolog. This number excludes pidgins (79), isolates (198), mixed languages (23), artificial (9), speech registers (6), “unattested” (61), “unclassifiable” (117) and sign languages (166). Creoles in Glottolog are classified under their lexifier family, making them hard to count, but doesn’t increase the number of families. There are 37 language with "creole" or "kriol" in their name in Glottolog, but I didn't subtract these.

Obviously there are some major differences in how Ethnologue and Glottolog classify languages in these categories, let's dig into each issue.

Extinct and threatened languages
Now, Glottolog and Ethnologue both contain extinct and threatened languages, but not in the same way and not clearly marked. Glottolog doesn't mark this at all currently, but is supposedly going to add this by adopting UNESCO's classification of language endangerment. This would mean at least 230 extinct languages.

Ethnologue is famously a catalogue of living languages, but there is information about endangerment there. In their scale of endangerment (which is different from UNESCO), there is a level for extinct (10), but there is no listing of how many languages actually are classified as such in the Ethnologue

But.. if you compare the listing of living languages per family here to the individual language families trees (like this one for example), the discrepancy between the two numbers is the number of extinct languages. For example, the family "creoles" is listed as having 88 living members, but 93 members in total = hence contains 5 extinct. (I confirmed this by bothering them on twitter.) I'm not sure how rigorous their information about extinct languages are (it might be just those gone extinct since their first edition, in 1951). In the row for Ethnologue below I'll put the total number in brackets after the living number.

The numer of extinct languages in Ethnologue appears to be 370, considering the listing of the total number of languages in Ethnologue 2015 as 7,472 at this wiki article. However, I've not been able to confirm that.

EDIT! Ethnologue responded, the SIL International counts up to 633 extinct languages via ISO 6393-3. Since the catalogue only features living languages, they're probably not all there. My previous way of counting via the Ethnologue was erroneous.

Needless to say the amount of extinct languages is much larger than any of these numbers. Let's do a quick overview count that I've borrowed from Bickel (p.c.): if we've spoken language for at least 100,000 years, and at any time there's been at least 5,000 lgs (we expect diversity to not grow with large population expansions) and that a language changes from one to a newer "version" at least once every 1,000 years: then there's been at last half a million languages in human history and less than 2% live today. This is just a though experiment to illustrate, much better calculations can be made.

To sum up:



Languages
Top-genetic units
Threatened or worse lgs in Ethnologue (but alive!)
2,447
?
Extinct lgs in Ethnologue
633
?
Extinct lgs in Glottolog
?
?
Extinct lgs in UNESCO’s atlas
230
?
Threatened or worse in UNESCO’s atlas
2,236
?
Extinct languages ever(?)
493,000?


Isolates and small families
Another category that is potentially tricky between Ethnologue and Glottolog is isolates, the languages without (living) relatives. First off, since Glottolog contains extinct language this becomes different already if living relatives is a condition. Secondly, Ethnologue has one grouping called "isolates" whereas Glottolog puts them each into their own group with only one member, this has drastic difference for the number of "families" or "top-genetic units for isolates, as you can see form this table below.

The table should be read like so: "there are 97 families of 10 languages or less in Ethnologue, all those families together contain 353 languages".


Languages
Top-genetic units
Families of 10 lgs or less in Ethnologue
353
97
Families of 10 lgs or less in Glottolog
740
178
Isolates in Ethnologue
1
Isolates in Glottolog
Isolates according to Campbell (unpublished)
129
?

It's also interesting to note how many more small families there are in Glottolog, it just goes to show how much more splitting Glottolog is genealogically than Ethnologue - one can see it in the overall counts, but here it becomes extra visible.

Contact languages
As stated earlier, what is and what is not a contact language is tricky business (and perhaps some isolates are just old creoles? and perhaps all/many sign languages are creoles?). Both Ethnologue and Glottolog used the categories of mixed language, pidgin and creole. They also have two top-genetic units (families) for all mixed languages and pidgins respectively. But (!), when it comes to creoles Glottolog sticks them in their main lexifiers family whereas Ethnologue has one family of all creoles. For example, Kriol [rop/krio1252] is in Ethnologue found in the creole language family (under "English based") whereas in Glottolog it's found under the Indo-european language family and then Pacific Creole English.

This makes a difference, but not much in number of top-genetic units/families (+-1). Glottolog also don't mark languages as being creoles, meaning I could not know how many they are. I searched for languages with "creole" or "kriol" in their name, and got 37 hits in 3 different families. The difference here between Ethnologue and Glottolog is probably not that great at all, but it's unknown still.



(Deaf) Sign languages
When it comes to the knowledge of sign languages in the world, we're at the very-very beginning. If you're curious, have a look at this project lead by prof Ulrike Zeshan - the researcher who basically started the entire field.

For our comparison, the differences are quite small. It's interesting to note that both place all sign languages into one top-genetic unit/family even though they both (surely) know that not all sign languages are related. This makes it clear that top-genetic unit need not be a grouping of languages with a proposed shared ancestor.



Unattested, unclassified unclassifiable
Both in Ethnologue and Glottolog there are languages that have proven very hard to put in a specific basket. They're not classified as isolates, but it's not clear where they'd belong either. Glottolog uses the term "unattested" to mean proposed language varieties without form-meaning paris and "unclassifiable" for those with not enough information to be classified.

The difference between Ethnologue's unclassified and Glottolog's unclassifiable is actually quite large, and it's not entirely clear to me why that is.



Languages
Top-genetic units
1
Unattested in Glottolog
61
1
Unclassifiable in Glottolog
117
1

Dialects and speech registers
Dialects, i.e. subvarities of a "language", exist both in Glottolog and in Ethnologue but are not super well cared for in either location. Dialects in Glottolog get a code of their own, and are found under their parent language, but they can sometimes be a bit unsystematic and improvement is definitely needed (as the editors well know). Ethnologue lists dialects and sometimes also related languages in the information for each language, but it's unclear how systematic this is. In both cases, it's not clear how structure between dialects work if for example one dialect is a sub variety of another dialect.. (it's turtles all the way down you know).

Dialects can vary with geography, socio-economic status (usually termed sociolects) and other variables. A specific sub variety that is only used in certain settings is called a register. Glottolog has a family of "speech registers" with 6 members. Ethnologue has no such grouping

Constructed/artificial languages
Both Glottolog and Ethnologue contain known made-up languages, they just used different labels. Interestingly, Ethnologue only counts Esperanto whereas Glottolog counts to 9. It's unclear how many of those listed by Glottolog. This is also interesting to compare to CALS - Conlang Atlas of Language Structures which contains 772 entires.



In conclusion
These categories are tricky, and different policies give different outputs. The most striking difference, to me, is the number of isolates, small families and families in general. This makes a difference if you've included members of those small families or isolates and are comparing your trees to those of Ethnologue or Glottolog. To my knowledge, not much phylogenetic work is being done on those languages so the impact is probably small.

If you liked this and/or us, let us know by writing to us or by spreading or stuff around the twitter-facebook-tumblr-spheres.

For more about Ethnologue and Glottolog, you can check out posts at this blog tagged "Ethnologue" or "Glottolog".


POST-EDIT
After I published this blog post Harald Hammarström kindly pointed out that many/all queries these numbers give rise to can usually be solved by paying close attention to Glottolog and Ethnologues policies, which was linked to earlier but can be repeated again (Glottolog here and Ethnologue here). He also pointed out that it might be interesting for readers to look at his review article of Ethnologue from earlier this year (free PDF here, appendices here).

There's more reading to recommend, these posts from this blog deserve to be repeated again: thisthis and this post. As well as this post on Diversity Linguistics Comment on language codes and this post also form DLC on language names.

Any more questions or comments, feel free to contact us.

References:
Campbell, 2007. How Many Language Families are there in the World, Really? Paper presented at the International Conference on Historical Linguistics, Aug 6-12, 2007, University of Quebec, Montreal.


Hammarström, Harald & Forkel, Robert & Haspelmath, Martin & Bank, Sebastian. 2015.  Glottolog 2.6.  Jena: Max Planck Institute for the Science of Human History.  (Available online at http://glottolog.org, Accessed on 2015-11-11.)

Lewis, M. Paul, Gary F. Simons, and Charles D. Fennig (eds.). 2015. Ethnologue: Languages of the World, Eighteenth edition. Dallas, Texas: SIL International. Online version: http://www.ethnologue.com.

Muysken, P. and Smith, N. 1995. The study of pidgin and creole languages. In Jacques Arends, Pieter Muysken, N. S., editor, Pidgins and Creoles. An Introduction. John Benjamins, Amsterdam, Philadelphia. 

Thomason, S. 1997. Contact Languages: A wider perspective. John Benjamins, Amsterdam.