“Rosetta Stone” — The most important text sample in history and the role of labeled data in machine learning
Imagine you open the morning paper and you stare at a collection of symbols and characters. And you have no idea what this all might mean.
Fortunately this is not the case. We all have been trained to convert the sequence of characters into meaning. We were trained to read. This implicitly assumes that a) there are known rules how to convert text into understandable messages and b) we are aware of this rules.
In a more analytical sense, we learnt these rules in the past and we are using it in the text interpretation engine in our brain as a scoring logic on fresh text samples and receive meanings and associations.
During a business trip in London in 2019 I visited the British Museum to see, what to my opinion is the most important text sample in history, with my own eyes— the Rosetta Stone.
Detected in 1799 by the French in Egypt, taken over by the British in 1801 and displayed in the British Museum since 1802 .
It is the only known artefact where the same text is written in 3 languages: Hieroglyphs, Demotic, Ancient Greek. Over centuries the script of the old Egyptians has been a miracle to researchers. They were not able to decipher their texts. Rosetta Stone built the bridge between these scripts and allowed us to translate.
To express that in a data science language: for centuries a very large collection of Hieroglyph texts was available. However there was no scoring logic to assess the content and to understand the meaning.
Data that allowed us to learn the underlying logic
The small data sample from the Rosetta Stone allowed us to learn the rules and the logic of the Hieroglyph script. Consequently mankind had a logic that could be applied on large Hieroglyphs text collections and to interpret and to read them. We benefit from this training sample of text (documents) and are able to apply learning techniques to understand and interpret the many other text collections which were found in palaces and pyramids of the Ancient Egyptians.
The exhibition in the British Museum states that hieroglyphs were originally assumed to be a “picture language”. Based on the Rosetta Stone researchers found that it is a combination of characters with a picture sign.
Without Rosetta Stone we might have guessed for decades and centuries about the potential meaning of the Hieroglyphs.
Maybe these days we would have applied machine learning methods , like pattern recognition combined with computer vision methods. And after thousands of hours of CPU processing with advanced deep learning methods we might have generated similar findings.
However this would have involved immense effort and highly complex calculations — whereas it was solved by the availability of a stone which contained the relevant content, the relevant “labels”.
Labeled data is important
A similar but not identical process is in place in supervised machine learning.
- In computer vision/image classification we usually have large collections of unlabeled pictures and we need a certain amount of labeled pictures to learn whether we should assign a picture of that collection to category A or B.
- In fraud investigation machine learning models benefit from a sample of labeled transactions that represent the outcome of the investigation. These cases can be used in a supervised machine learning model to train a model to be able to automatically score future transactions in real-time.
Text Analytic Applications
Over the last years text analytics and natural language processing has made a huge progress.
- We are to classify text, derive the content of text collections to enrich our predictive models.
- Logics can be automatically derived to route emails and documents and explicit and implicit searches for relevant text elements can be performed
- Text analytics is applied in speech recognition, natural language interpretation and text generation, eg. by training models to automatically generate titles for a review on a website.
- In SAS Model Studio, natural language generation is used to explain and to interpret the content and relationships of machine learning models.
Being Impressed
Standing in from of the Rosetta Stone in the British Museum of London has been an impressive experience for me. In that moment I was able to connect many challenges and milestones of my career in data science with the exhibited artefact behind the glass vitrine. It was like touching base and experiencing the importance of data availability in real.
Links
https://en.wikipedia.org/wiki/Rosetta_Stone
Chapter 3 and 9 in “Data Quality for Analytics Using SAS” discuss the importance of data availability for data science.
The two pictures in this article have been photographed in the British Museum in 2019.