LINGUIST List 26.2325

Mon May 04 2015

FYI: The Gavagai Living Lexicon

Editor for this issue: Ashley Parker <>

Date: 04-May-2015
From: Magnus Sahlgren <>
Subject: The Gavagai Living Lexicon
E-mail this message to a friend

We are proud to announce the release of the Gavagai Living Lexicon - an online lexicon that gives you access to the knowledge our distributional semantic models gather about terms in language as it is used by people in every corner
of the known world.

The lexicon is available at:

The lexicon is based on Gavagai's distributional semantic models that learn language constantly from live data feeds with millions of documents per day from both social and news media. This means that the living lexicon is
continuously evolving and always à jour with current language use. As an example, try searching for some topical term like ''earthquake'' ( to see what the lexicon has learned during the last couple of days.

The lexicon currently provides the following information:

- the frequency rank of the term in the lexicon
- similarly spelled terms
- common left and right neighbors (i.e. left and right collocations)
- multi-word units (n-grams) that include the search term
- semantically similar terms (i.e. terms that have been used in a similar way in online data)
- associatively related terms (i.e. terms that have often been used in the same documents as the search term)

Both the semantically similar and associatively related terms are automatically grouped into clusters of similar and related terms, respectively. The semantic groups are also labelled with the most common collocations. You can think of the labels as an explanation for why the terms are clustered together. As an example, try searching for ''apple''
( You can see that the distributional semantic model has learned a number of different usages of ''apple," including apple as an ingredient, apple as a product, apple as a stock, and apple as a fruit. Another example is a search for ''suit'' (, which demonstrates that the
lexicon has learned both the garment sense and the legal sense.

The lexicon is currently available in Arabic, Danish, English, Estonian, Finnish, German, French, Latvian, Lithuanian, Norwegian, Portuguese, Russian, Spanish, and Swedish. More languages will be added continuously. The size of
the vocabulary for each language depends on the amount of online data we listen to for that particular language. English is currently the largest language in the lexicon, with a vocabulary of more than 2,500,000 unique terms. The 200,000 most common of these terms have entries in the English lexicon.

If you are a developer and want to access the lexicon functionality directly through our API, simply sign up for a free developer account at:

Note that our developer APIs also feature functionalities for doing multi-document summarization, tonality analysis, and keyword extraction.

We appreciate any feedback on the lexicon and our APIs.

Contact us at:

(Publications describing the algorithms behind the living lexicon are under preparation and will be added to the lexicon site once published.)

Linguistic Field(s): Computational Linguistics
Text/Corpus Linguistics

Page Updated: 04-May-2015