Constructing the future of (digital) minds Home | Contact 
Development: Beyond the XEROX benchmark POS-tagger
Posted: 2015-10-30

The Xerox reference tagger has been the benchmark for Part-of-Speech tagging for quite a while now. Today we are proud to announce that our POS-tagger has surpassed this benchmark. The most important fact however is not specifically surpassing the reference benchmark, but the incredible big difference in the amounts of data used by Xerox and by us.

The Xerox tagger scores just over 96% for correctly tagged words in a sentence. Our own tagger currently sits just a few tenths of a percentage point above this. However, the Xerox tagger was initally trained with a human-tagged corpus of around 300 million wordts in sentences, to be able to attain their score. Our tagger, in sharp contrast, is an untrained tagger, meaning that there was no existing corpus used to train the system. Our tagger learns from the sentences that are actually fed to the system for tagging and we use supervised training to tell the tagger where it went wrong. The MIND|CONSTRUCT POS-tagger got to the benchmarked state after only roughly 2200 words in about 800 sentences.

To prevent the system from building a bias towards certain sentence constructions, we trained it with sentences that where randomly picked from English literature. That also means that the system needed to be able to handle 'old' sentence forms that are no longer in use in modern conversation. The system showed to be able to cope with those sentences without problems.

Our POS-tagger is available for demonstration for interested parties and (prospective) investors.

Back to index

News categories

Our news spans several categories. To help you navigate our news catalog, you can find specific news-sections for each category below.

  • All news


Below you can filter the news list to find personal weblog entries.

©2010-2015 MIND|CONSTRUCT - All rights reserved @Google+