World Language Dataset

World Language Dataset [page under construction]

This data set aims to provide detailed information about language usage worldwide since 1945. This dataset is hosted by Zeev Maoz, University of California-Davis.

Overview
The World Language Project (WLD) aims to provide detailed information about language use worldwide from 1945 to 2010. It contains data about the number of speakers by language in each of the states in the international system. These numbers are given for every half-decade period (1945, 1950,….,2010). The data records percentages of the state’s population that speak a given language. Some of the languages (as detailed in the Codebook) are divided by mother-tongue vs second or multi-language speakers.

This project is still under development. Thus far, language usage data for all of Europe and a large portion of Africa and South America has been collected and compiled. The project has been developed in three stages. The first stage consisted of identifying the primary languages used in a given country from independence to 2010. The second stage consisted of identification of major data sources of language use and the collection of data from these sources according to a language use tree classification. The third stage is still under way and will consist of cleaning the data, reconciling discrepancies of information from different sources, and imputing data for the missing cases.

Acknowledgments
I would like to thank the National Science Foundation for helping to fund this project. I would also like to thank Multitree Digital Library for their cooperation in accessing important language data. Finally, I would like to thank my graduate student researchers, Paul Johnson and Jaime Jackson, as well as the undergraduate coders Nick Smith, Hazel Hyon, Irene Ezran, Anush..., Keane..., Jelena..., and Gurkern.... for their efforts in collecting and managing the World Language Data.

Citation

Dataset
The WLD contains (will contain) three datasets: the national language dataset, the regional language dataset, and the global language dataset.

The National Language Dataset. The unit of analysis of this dataset is the individual state, observed at five-year intervals. This dataset provides information regarding the number of speakers by language, as well as the percent of the state’s population speaking a given language.

The Regional Language Dataset. The unit of analysis of this dataset is the region, observed at five-year intervals. This dataset utilizes the COW regional designations to identify language use in the region for each half-decade.

The Global Language Dataset. The unit of analysis of this dataset if the global system, observed at five-year intervals. The dataset aggregates the number of speakers of a given language for all states, globally.

Questions and Feedback
The World Language data set is hosted by Zeev Maoz (University of California, Davis). Further details on the datasets and coding procedures as well as access to raw data can be obtained from the Graduate Student Researcher, Jaime Jackson via email at [email protected].

Links

WLD national data

WLD regional data

WLD global data

WLD Codebook

WLD Sources

Multitree Digital Library