News Article
|
Published: 2015-10-30 in Technology
The Xerox reference tagger has been the benchmark for Part-of-Speech tagging for quite a while now. Today we are proud to announce that our POS-tagger has surpassed this benchmark.
The most important fact, however, is not specifically surpassing the reference benchmark, but the incredible big difference in the amounts of data used by Xerox and by us. The Xerox tagger scores just over 96% for correctly tagged words in a sentence. Our tagger currently sits just a few tenths of a percentage point above this. However, the Xerox tagger was initially trained with a human-tagged corpus of around 300 million words in sentences, to be able to attain their score. Our tagger, in sharp contrast, is an untrained tagger, meaning that there was no existing corpus used to train the system. |
Our tagger learns from the sentences that are actually fed to the system for tagging, and we use supervised training to tell the tagger where it went wrong. The MIND|CONSTRUCT POS-tagger got to the benchmarked state after only roughly 2200 words in about 800 sentences. To prevent the system from building a bias towards certain sentence constructions, we trained it with sentences that were randomly picked from English literature. That also means that the system needed to be able to handle 'old' sentence forms that are no longer in use in modern conversation. The system showed to be able to cope with those sentences without problems. |
NOTE: This news-article is presented here for historical perspective only. This article is more than two years old. Therefore, information in this article might have changed, become incomplete, or even completely invalid since its publication date. Included weblinks (if present in this article) might point to pages that no longer exist, have been moved over time, or now contain unrelated or insufficient information. No expectations or conclusions should be derived from this article or any forward-looking statements therein. |
© 2024 MIND|CONSTRUCT |