The traditional approach of accessing data from a computer system is through some form of organized data storage, e.g. a database. The limitations so far have been that the structure of the data had to be very specific and the preparation of the data and the software to access it was complicated, but once the data was there, the access was fast.
Enter the world of non-structured data and statistical approaches to handle the vast amount of data that did not fit in the world of relational databases. Written text, is non-structured data, but only in the sense that it, or its components, cannot be easily retrieved in a relational structured world. A new approach was formed to generate statistical maps, quantify relationships between words, analyze the structures of sentences and use all kinds of creative linguistic tricks to get a handle on the problem.
The problem now is that we entered the realm of probabilities and chasing the 100% probable fact in our queries. Do we need that? Not sure! There are some compelling arguments that a spell checker or auto-completion is useful, even when it is only 80% correct. Let’s not forget there is supposed to be that human on the other end that approves the content.
The latest fad is neural-networks. I actually played with one in the late 80ies but it left the feeling that we have a long way to go here. The hardware and math processing has finally caught up with the idea. The neural networks now offer a much more elegant approach to the blunt word counting in a corpus. We can now train a system to recognize cats in a picture and have phone apps that can diagnose the maliciousness of a skin mole better than the average dermatologist.
But here again we observe two things: 1.)The system does not know everything. There is a error rate and 2.) there is a human that makes the final decision. The technical cost to train a neural network is immense, depending on the complexity of the task to learn. The probability has been going up since the statistical approach, but the last percent to reach the “confirmed, I know this” state are incredibly expensive to achieve.
So, please don’t see that I am dissing the new technologies, I think they are awesome and a fresh breath to this old programmer. What I am saying is that we need to step back for a moment and stop being distracted by flashing lights.
In the realm of languages, both statistical and neural networks are cool gadgets, when a human is guiding the result. Can a neural network suggest a translation into a foreign language to rent a car or explain a street sign, yes. There is not much lost if it were wrong and ultimately we are talking about a technology that supports human to human interaction. It could lead to a funny episode.
Would you have a neural network argue unsupervised a legal case in a foreign language in your behalf?
Would you have a neural network decide a legal case?
Where I see both statistical and neural networks really useful, particularly in the linguistic realm is as generating seed data into a relational database that stores the results. Yes, I wrote it, I am reverting back to structured data, accessed in a structured way. Languages and the mapping of their interlingual concepts are a massive problem. Having these databases seeded by either statistical or neural networks would solve a huge amount of work. Of cause we are importing the same error rates from the data sources, but it is relative easy to fix the detected errors by humans, e.g. crowd sourcing, discourse resolution, etc.. This database can also be utilized as a reference to compare against the results of other approaches to develop and parse the knowledge of the world. There is no fear that artificial intelligence will take over the world. There is a fear that artificial intelligence will be given powers that it shouldn’t have because it inherently cannot be trusted to do things it has not be trained for. AI will be impressive., but humans will have the last word.