Confidence Calculation of Semantic Sense Equality for Multi-Language Interlinguals

Posted on Posted in Uncategorized

The recent participation in the TIAD workshop of the LDK-2017 conference in Galway has had us come up with an algorithm to calculate the confidence of a chain of dictionary entries (language pivots). The term ‘confidence’ here refers to the confidence that a chain of n languages (n-1 dictionary entries) has the same semantic meaning. The algorithm is very simple. It calculates the confidence as a relation between the number of existing links (edges) between the n languages and the number of possible links between n languages.

First, let’s set some definitions:

  • A cluster of languages is a group of distinct languages that are linked though dictionary entries representing a semantic sense (edges).
  • A node in this cluster is a lexeme, a basic lexical unit of a language, in one syntactic role, referred to by all morphological variations belonging to that lexical unit.
  • A edge, connecting the nodes of a cluster, effectively a dictionary entry, convey a single semantic sense, translated from one language to another.
  • We assume that dictionary entries are bi-directional, L1 -> L2 is the same as L2 -> L1.

The clusters of size 1 and 2 can theoretically be considered, but are ignored in our considerations. They contain little verifiable information.

The smallest meaningful cluster is of size 3 languages. In this cluster we can have either 2 edges or 3 edges. The maximum number of edges for a three language cluster is 3: All languages are connected to each other with dictionary entries. If we have 2 edges, our confidence that the cluster represents the the same semantic concept is 2/3, or 66%, while a cluster with 3 edges has a confidence of 3/3, or 100%. This seems to reflect what we expect in reality. If a cluster is a closed loop of three distinct languages, we have a high confidence that all language terms refer to the same semantic sense, while a open chain of 3 languages, connected by 2 edges gives us a good idea what the missing link could look like, but we are not 100% sure.

Larger clusters and chains also follow the intuition, but we can observe more confidence variations. A cluster of 4 languages can have 6 edges connecting each language node with the others. A cluster of 5 languages can have 10 edges connecting each language node to the others. There is a pattern here: 3, 6, 10, .. the sequence is the triangular number sequence: 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55 … We ignore the first two entries for n=1 and n=2.

The number of possible edges for a n-language cluster is:

E(n) = ((n−1)n)/2.

When n gets larger than 3 we also see a new number that appears: The number of minimum links required to form a cluster of n distinct languages:

Emin(n) = n-1.

Obviously the number of existing edges can be anywhere between Emin(n) and the maximum number of edges E(n). The larger the cluster, the lower the confidence that the chain of minimum edges refers to the same semantic sense. Anyone remember how much fun they had playing “Stille Post” or the English version “Chinese whispers”? Picture this with players that each speak only 2 different languages and that have to forward a message they heard in language 1 in language 2 to the next player.

Fortunately our confidence calculation follows reality by actually dropping to or below 50%.

Here is a graph for confidence of the first 10, starting at Emin.

So, what is the purpose of all this? We want to come up with a formula that allows a machine to make assessments if and under what conditions loops of 3 languages can be merged into an interlingual node, representing 3 or more languages. Of course next are the conditions on which to then merge it with a 4th and 5th … language. The reason we want to merge to an interlingual node is simply storage requirements. A multi language dictionary of e.g. 20 languages would require E(20)=190 edges to store, and parse, for each semantic sense! The Linguistic Society talks of 6909 distinct languages, not counting dialects and variations.
The second reason is that the semantic networks in the NeuroCollective require to be centered around interlinguals.

Leave a Reply

Your email address will not be published. Required fields are marked *