News Article
Beyond the XEROX benchmark POS-tagger
Beyond the XEROX benchmark POS-tagger

Published: 2015-10-30 in Technology

The Xerox reference tagger has been the benchmark for Part-of-Speech tagging for quite a while now. Today we are proud to announce that our POS-tagger has surpassed this benchmark.

 

The most important fact, however, is not specifically surpassing the reference benchmark, but the incredible big difference in the amounts of data used by Xerox and by us. The Xerox tagger scores just over 96% for correctly tagged words in a sentence. Our tagger currently sits just a few tenths of a percentage point above this.

However, the Xerox tagger was initially trained with a human-tagged corpus of around 300 million words in sentences, to be able to attain their score. Our tagger, in sharp contrast, is an untrained tagger, meaning that there was no existing corpus used to train the system.

 

Our tagger learns from the sentences that are actually fed to the system for tagging, and we use supervised training to tell the tagger where it went wrong. The MIND|CONSTRUCT POS-tagger got to the benchmarked state after only roughly 2200 words in about 800 sentences.

To prevent the system from building a bias towards certain sentence constructions, we trained it with sentences that were randomly picked from English literature. That also means that the system needed to be able to handle 'old' sentence forms that are no longer in use in modern conversation. The system showed to be able to cope with those sentences without problems.

 

NOTE: This news-article is presented here for historical perspective only.

This article is more than two years old. Therefore, information in this article might have changed, become incomplete, or even completely invalid since its publication date. Included weblinks (if present in this article) might point to pages that no longer exist, have been moved over time, or now contain unrelated or insufficient information. No expectations or conclusions should be derived from this article or any forward-looking statements therein.

Telegram
LinkedIn
Reddit
© 2024 MIND|CONSTRUCT  
Other Articles in Technology
 
  News
  • 2024-05-13 - Codedness platform does Clojure 
  • 2023-07-27 - A new 'Database paradigm' for ASTRID 
  • 2021-09-30 - ASTRID production code sets new training speed record 
  • 2021-05-20 - ASTRID code rewrite for production started 
  • 2017-12-08 - Major milestone: Knowledge Representation implemented 
  • 2015-06-01 - Major milestone: POS-tagger implemented 
  • 2015-03-02 - Prototype development started 
  • 2012-04-20 - Selecting the database platform for ASTRID 
  • 2012-04-19 - Basic development tooling selected 

 
  Blogs
  • 2023-01-17 - How choosing Clojure made us come full circle - Hans Peter Willems - CEO MIND|CONSTRUCT