Uncategorized

A translation engine must be specific and not based on probability

Human translation is an interpretation. The translator ‘understands’ original source information and converts it into a target language. This obviously leads to problems that are typically handled by the cognitive abilities of the human brain.

Human communication is two-way communication (for the most part). An ambiguous expression can easily be cleared up by asking to refine it.

In written communication, it is a bit more tricky, especially when the author is not available to be asked. Translation of written text naturally opens up alternatives that could have lasting consequences e.g. the somewhat misleading interpretation of Wittgenstein’s ‘Wortspiel’ as ‘word game’ instead of ‘wordplay’. Reading the German text suggests that Wittgenstein was talking about the actor role of words in a theater like play and how the same actors are interacting in different plays, yet retain a sense of a recognizable meaning throughout all plays. A modern analogy would how actors in today’s movies seem to fall into the same role category.

So to nut-shell this observation, communication between humans needs to be confirmed in a similar way describes in the post here.

If an idea (simplify concept) is communicated from person A to person B, person B needs to confirm to A that the same concept is understood. This can happen directly, but it leaves a large uncertainty resulting from the individual contextual background and of course the complexity of the concept/idea.

What really needs to happen is that person B needs to explain the idea to person C and then person C needs to explain it to person A. This would form a more robust triangle of information that allows reasonable confidence that all three persons are talking roughly about the same thing.

In modern software development, with very complex systems, the documentation of requirements, use-cases, architecture documents, etc. are used as a medium to develop consensus among project participants. The triangle of person A writing a requirement, person B using that requirement to develop a feature and then person A evaluating the feature is such a communication loop. All too often this is a great example of failed communication.

If we incorporate machine translation into our lives and this translation is based on ‘learning’ from the existing parallel corpus (questionably translated text from the world-wide-web) we are introducing an interpreter into the communication between person A and person B. Lets call this interpreter MT. By definition person A cannot confirm the translation of MT, otherwise, person A would not need MT. Person A has to trust that MT is translating correctly and precise. Person B, receiving translated communication from MT also cannot verify that MT translated correctly and has to take the translation at face value. The reflecting/echoing of the communication back to person A through MT proves that MT is consistent but not that the concept is understood. Similar to the language pivoting problem it needs a third medium to confirm that the meaning is still the same.

A similar problem is: How to verify that the same color ‘blue’ is perceived by two random humans. Obviously both humans have been trained in a similar environment, with somewhat similar genetic makeup, but the only way to confirm that both humans perceive the same spectral wavelengths as ‘blue’ is to have a precise, calibrated optical device measure what each person calls ’blue’ and translates the color into a range of spectral wavelengths that can be compared.

Translating person A’s term ‘blue’ into a wavelength range and then into the equivalent in a different language is a precise communication that can be verified but most natural language terms have multiple senses, depending on context and are not even precisely formulated by the individuals that want to express an idea/concept.

Lexicography has been developing standard mappings between languages for decades and created a subset of word mappings for average communication. Specialized dictionaries expanded this into domain-specific areas. This is the only approach that can work as a machine translation platform because it is curated and reviewed.

It’s limitations are:

  • very specific, tedious and therefore expensive work, that is typically proprietary
  • the approaches to store the data is very limited, particularly on a moderated crowdsourcing level

Today’s approaches, without naming the players: are training awesome neural network engines with marginal translation data because we do not want to spend the energy and money to actually curate the data and store it, e.g. in an interlingual graph database.

The NeuroCollective is a symbiotic environment between neural engines trained with interlingual curated internal data. The internal data is generically enough to perform translations but the neural engine allows making an ‘educated guess’ when an out of dictionary term is encountered. This out of dictionary neural engine response is then cross-validated/curated and added to the dictionary, which periodically is used to re-train the neural engine.

The neural engine is also used to compare neural, learned pattern based responses against interlingual graph database data responses.

If you are interested in this work, have data to compare or contribute please contact us at: sales@neurocollective.com

Leave a Reply

Your email address will not be published. Required fields are marked *