Preprocessing

Raw text in the wild rarely comes in an ready-to-understand format, cleaning it up is a boring but necessary first step in any extraction process. TextRazor takes care of all of this for you.

For processing web pages, TextRazor can automatically strip HTML tags and remove excess boilerplate content from your text before processing, automatically determining the "main" body of content.

TextRazor uses fast language specific rule based tokenizers and sentence segmenters to convert your documents into a sequence of words and sentences in a similar style to the industry standard Penn Treebank.

Our extraction algorithms are built on a custom set of high performance Natural Language Processing tools. This includes our CRF Part of Speech Tagger, Noun Phrase detector, morphological English lemmatizing and stemming algorithms.

Language Support

HTML extraction, tokenization, stemming and sentence detection is supported in all TextRazor languages. Part of speech tagging, phrase detection and lemmatization are currently only supported in English.

API Calls

TextRazor will strip the HTML from your content when the "cleanupHTML" property of the request is set to "true". When this option is enabled the "cleanedText" part of the response will be populated with the HTML free content that TextRazor analyzed. All offsets returned from the API will refer to this content.

To retrieve the processed tokens and sentences from your text simply add the "words" extractor to your requests.

Read more in our Python Client or REST Documentation.