TextRazor Python Reference

TextRazor's API helps you rapidly build state-of-the-art language processing technology into your application.

Our main analysis endpoint offers a simple combined call that allows you to perform several different analyses on the same document, for example extracting both the entities mentioned in the text and relations between them. The API allows callers to specify a number of extractors, which control the range of TextRazor's language analysis features.

The Python SDK builds on top of our REST API, adding Pythonic wrappers around TextRazor annotations, automatically adding links between them where possible.

On this page you can find a full API reference for the TextRazor Python API. If you're looking to get started quickly, have a look at the tutorials for some simple examples.

If you have any queries please contact us at support@textrazor.com and we will get back to you promptly. We'd also love to hear from you if you have any ideas for improving the API or documentation.

We offer official Client SDKs in Python, Java, and PHP. Our REST API is easy to integrate in any other language.

Get started with just a few lines of Python:

import textrazor

textrazor.api_key = "API_KEY_GOES_HERE"

client = textrazor.TextRazor(extractors=["entities", "topics"])
response = client.analyze_url("http://www.bbc.co.uk/news/uk-politics-18640916")

for entity in response.entities():
    print(entity.id, entity.relevance_score, entity.confidence_score, entity.freebase_types)

Installation

The TextRazor Python SDK is self contained in a single file with no external dependencies. You can pick up the latest source from GitHub.

The easiest way to get started is with pip:

$ pip install textrazor

Alternatively you can simply copy textrazor.py into your own project.

Authentication

The TextRazor API identifies each of your requests by your unique API Key, which you can find in the console.

To set the TextRazor API key globally for all requests, set the textrazor.api_key = 'API_KEY_GOES_HERE' at the start of your application. Alternatively, the API key and other connection options can be set when creating each service object.

By default the Python client uses SSL connections to encrypt all communication with the TextRazor server.

Errors

The TextRazor Python SDK throws aTextRazorAnalysisException whenever it is unable to process your request for any reason. This comes with a descriptive message explaining the problem.

TextRazor errors are reported with their corresponding HTTP error codes and a descriptive message:

200: OK

Request processed successfully.

400: Bad Request

Request was in an invalid format. Check that your params have been properly encoded in the POST body and that your content is UTF8 encoded.

401: Unauthorized

The API key was invalid or has fully used up its quota.

413: Request too large

The request was too large (Up to 200kb may be processed per request).

500: Internal Server Error

TextRazor encountered a problem while processing your request. Our monitoring infrastructure alerts us of all failed requests, but if you notice a particular pattern in your errors (we consistently return an error for the same document for example), please contact support and we would greatly appreciate your debugging assistance.

Please check the TextRazor status page for updates on any system-wide issues.

To handle any intermittent connection issues we recommend that you design your application to gracefully retry sending failed requests several times before stopping and logging the error. TextRazor operates a resilient high-availability infrastructure, so any potential disruption should be kept to a minimum.

try:
    response = client.analyze_url("http://www.bbc.co.uk/news/uk-politics-18640916")
except TextRazorAnalysisException, ex:
    print "Failed to analyze with error: ", ex

Best Practices

TextRazor was designed to work out of the box with a wide range of different types of content. However there are several steps you can take to help improve the system further for your specific application:

Experiment with different confidence score thresholds. Where possible, TextRazor will return scores for each of its annotations representing the amount of confidence the engine has in the result. If you prefer to avoid false-positives in your application you may want to ignore results below a certain threshold. The best way to find an appropriate threshold is to run a sample set of your documents through the system and manually inspect the results.
TextRazor's algorithms use the whole context of your document to understand its contents and disambiguate its analysis. Overall accuracy of the engine may be improved if long documents with multiple different themes are split up.
On the other hand, if you have numerous small pieces of content that are likely related, you may get better results by concatenating them before calling the API. This may be the case where you are analyzing multiple Tweets from one user, for example, or if you are separately analyzing the headline and body of a news story.

Please do not hesitate to contact us for help with getting the most out of the system for your use case.

API Reference

textrazor.TextRazor(extractors) Object

All TextRazor analysis functionality is exposed in the textrazor.TextRazor class. To process a document, create a textrazor.TextRazor instance with your api key and the extractors you are interested in. Calls to analyze() and analyze_url() will then process a string or URL.

This class is threadsafe once initialized with the request options. You should create a new instance for each request if you are likely to be changing the request options in a multithreaded environment.

Note that for performance reasons some of the fields of each annotation are Python properties, and some methods. You can spot the difference in the documentation - the entries with parenthesis and params are methods.

The Python SDK expects all strings to be Python unicode strings. If you are using Python 2.x and reading from a utf-8 encoded source you must first decode it using text.decode('utf-8') .

Analysis

analyze(text)

Calls the TextRazor API with the provided unicode text. Returns a TextRazorResponse with the parsed data on success. Raises a TextRazorAnalysisException on failure.

analyze_url(url)

Calls the TextRazor API with the provided url.

TextRazor will first download the contents of this URL, and then process the resulting text.

TextRazor will only attempt to analyze text documents. Any invalid UTF-8 characters will be replaced with a space character and ignored. TextRazor limits the total download size to approximately 1M. Any larger documents will be truncated to that size, and a warning will be returned in the response.

By default, TextRazor will clean all HTML prior to processing. For more control of the cleanup process, see the set_cleanup_mode option.

Returns a TextRazorResponse with the parsed data on success. Raises a TextRazorAnalysisException on failure.

TextRazor's response contains a "raw" Python object. This can be serialized and reconstructed later on, making it easy to save TextRazor objects:

response = client.analyze_url("http://www.bbc.co.uk/news/uk-politics-18640916")
json_content = response.json
new_response = textrazor.TextRazorResponse(json_content)

Analysis Options

set_extractors(extractors)

Sets a list of “Extractors”, which tells TextRazor which analysis functions to perform on your text. For optimal performance, only select the extractors that are explicitly required by your application.

Valid Options: entities, topics, words, phrases, dependency-trees, relations, entailments, senses, spelling

set_rules(rules)

String containing Prolog logic. All rules matching an extractor name listed in the request will be evaluated and all matching param combinations linked in the response.

set_do_cleanup_HTML(cleanup_html)

Deprecated - Please see set_cleanup_mode(cleanup_mode)

When True, input text is treated as raw HTML and will be cleaned of tags, comments, scripts, and boilerplate content removed. When this option is enabled, the cleaned_text property is returned with the text content, providing access to the raw filtered text. When enabled, position offsets returned in individual words apply to the clean text, not the provided HTML.

set_cleanup_mode(cleanup_mode)

Controls the preprocessing cleanup mode that TextRazor will apply to your content before analysis. For all options aside from "raw" any position offsets returned will apply to the final cleaned text, not the raw HTML. If the cleaned text is required please see the cleanup_return_cleaned option.

Valid Options:

raw

Content is analyzed "as-is", with no preprocessing.

stripTags

All Tags are removed from the document prior to analysis. This will remove all HTML, XML tags, but the content of headings, menus will remain. This is a good option for analysis of HTML pages that aren't long form documents.

cleanHTML

Boilerplate HTML is removed prior to analysis, including tags, comments, menus, leaving only the body of the article.

set_cleanup_return_cleaned(self, return_cleaned)

When True, the TextRazor response will contain the cleaned_text property, the text it analyzed after preprocessing. To save bandwidth, only set this to True if you need it in your application. Defaults to False.

set_cleanup_return_raw(return_raw)

When return_raw is True, the TextRazor response will contain the raw_text property, the original text TextRazor received or downloaded before cleaning. To save bandwidth, only set this to True if you need it in your application. Defaults to False.

set_cleanup_use_metadata(use_metadata)

When use_metadata is True, TextRazor will use metadata extracted from your document to help in the disambiguation/extraction process. This include HTML titles and metadata, and can significantly improve results for shorter documents without much other content.

This option has no effect when cleanup_mode is 'raw'. Defaults to True.

set_download_user_agent(user_agent)

Sets the User-Agent header to be used when downloading over HTTP. This should be a descriptive string identifying your application, or an end user's browser user agent if you are performing live requests from a given user.

Defaults to "TextRazor Downloader (https://www.textrazor.com)"

set_language_override(language_override)

When set to a ISO-639-2 language code, force TextRazor to analyze content with this language. If not set TextRazor will use the automatically identified language.

set_do_compression(do_compression)

When True, request gzipped responses from TextRazor. When expecting a large response this can significantly reduce bandwidth. Defaults to True.

set_do_encryption(do_encryption)

When True, all communication to TextRazor will be sent over SSL, when handling sensitive or private information this should be set to True. Defaults to True.

set_entity_dictionaries(self, entity_dictionaries)

Sets a list of the custom entity dictionaries to match against your content. Each item should be a string ID corresponding to dictionaries you have previously configured through the DictionaryManager interface.

set_entity_dbpedia_type_filters(filters)

List of DBPedia types. All returned entities must match at least one of these types. For more information on TextRazor's type filtering, see http://www.textrazor.com/types. To account for inconsistencies in DBPedia and Freebase type information we recommend you filter on multiple types across both sources where possible.

set_entity_freebase_type_filters(filters)

List of Freebase types. All returned entities must match at least one of these types. For more information on TextRazor's type filtering, see http://www.textrazor.com/types. To account for inconsistencies in DBPedia and Freebase type information we recommend you filter on multiple types across both sources where possible.

set_entity_allow_overlap(allow_overlap)

When True entities in the response may overlap. When False, the "best" entity is found such that none overlap. Defaults to True.

set_classifiers(self, classifiers)

Sets a list of classifiers to evaluate against your document. Each entry should be a string ID corresponding to either one of TextRazor's default classifiers, or one you have previously configured through the ClassifierManager interface.

If you aren't tied to a particular taxonomy version, the current textrazor_mediatopics_2023Q1 is a sound starting point for many classification projects.

Valid Options:

textrazor_iab_content_taxonomy_3.0

IAB Content Taxonomy v3.0 Internet Advertising Bureau Content Taxonomy v3.0 is the latest (2022) update of the IAB Content Taxonomy.

textrazor_iab_content_taxonomy_2.2

IAB Content Taxonomy v2.2

textrazor_iab_content_taxonomy

IAB Content Taxonomy v2 is an updated version (2017) of the IAB QAG segments.

textrazor_iab

Legacy IAB QAG segments

textrazor_mediatopics_2023Q1

IPTC Media Topics - Latest (March 2023) version of IPTC's 1100-term taxonomy with a focus on text.

textrazor_mediatopics

IPTC Media Topics - Original (2017) version of the IPTC Media Topic taxonomy.

textrazor_newscodes

Legacy IPTC NewsCodes

custom classifier name

Custom classifier, previously created through the Classifier Manager interface.

textrazor.TextRazorResponse Object

ok

True if TextRazor successfully analyzed your document, False if there was some error.

error

Descriptive error message of any problems that may have occurred during analysis, or an empty string if there was no error.

message

Any warning or informational messages returned from the server, or an empty string if there was no message.

custom_annotation_output()

Any output generated while running the embedded Prolog engine on your custom rules.

cleaned_text

The processed raw text, only when cleanup_return_cleaned is enabled. When enabled position offsets returned in individual words apply to the clean text, not the provided HTML.

raw_text

The raw text, only when cleanup_return_raw is enabled.

entailments()

List of all Entailment across all sentences in the response.

entities()

List of all the Entity across all sentences in the response.

topics()

List of all the Topic in the response.

categories()

List of all the ScoredCategory in the response.

noun_phrases()

List of all the NounPhrase in the response.

properties()

List of all Property across all sentences in the response.

relations()

List of all Relation across all sentences in the response.

sentences

List of all Sentence in the response.

matching_rules()

List of all the rules that matched the request text.

words()

Generator of all Word across all sentences in the response.

language

The ISO-639-2 language used to analyze this document, either explicitly provided as the languageOverride, or as detected by the language detector.

language_is_reliable

Boolean indicating whether the language detector was confident of its classification. This may be false for shorter or ambiguous content.

textrazor.Entity Object

id

The disambiguated ID for this entity, or None if this entity could not be disambiguated. This ID is from the localized Wikipedia for this document's language.

english_id

The disambiguated entityId in the English Wikipedia, where a link between localized and English ID could be found. None if either the entity could not be linked, or where a language link did not exist.

custom_entity_id

The custom entity DictionaryEntry ID that matched this Entity, if this entity was matched in a custom dictionary.

confidence_score

The confidence that TextRazor is correct that this is a valid entity. TextRazor uses an ever increasing number of signals to help spot valid entities, all of which contribute to this score. For each entity we consider the semantic agreement between the context in the source text and our knowledgebase, compatibility between other entities in the text, compatibility between the expected entity type and context, and prior probabilities of having seen this entity across wikipedia and other web datasets. The score is technically unbounded, but typically ranges from 0.5 to 10, with 10 representing the highest confidence that this is a valid entity. Longer documents with more context will tend to have higher confidence scores than shorter tweets, so if you are choosing a confidence threshold it's a good idea to experiment with levels with your own data.

dbpedia_types

List of Dbpedia types for this entity, or an empty list if there are none.

freebase_types

List of Freebase types for this entity, or an empty list if there are none.

freebase_id

The disambiguated Freebase ID for this entity, or None if either this entity could not be disambiguated, or a Freebase link doesn’t exist.

wikidata_id

The disambiguated Wikidata QID for this entity, or None if either this entity could not be disambiguated, or a Wikidata link doesn’t exist.

matched_positions

List of the token positions in the current sentence that make up this entity.

matched_words

List of Word that make up this entity.

matched_text

Source text string that matched this entity.

data

Dictionary containing enriched data found for this entity.

relevance_score

Relevance this entity has to the source text. This is a float on a scale of 0 to 1, with 1 being the most relevant. Relevance is determined by the contextual similarity between the entities context and facts in the TextRazor knowledgebase.

wikipedia_link

Link to Wikipedia for this entity, or None if either this entity could not be disambiguated or a Wikipedia link doesn’t exist.

Represents a single “Named Entity” extracted from text.

Each entity is disambiguated to Wikipedia and Freebase concepts wherever possible. Where the entity could not be linked the relevant properties will return None.

Request the "entities" extractor for this object.

Scores

Entities are returned with both Confidence and Relevance scores when possible. These measure slightly different things. The confidence score is a measure of the engine's confidence that the entity is a valid entity given the document context, whereas the relevance score measures how on-topic or important that entity is to the document. As an example, a news story mentioning "Barack Obama" in passing would assign high confidence to the "President" entity. If the story isn't about politics, however, the same entity might have a low relevance score.

Scores can vary if the same entity is mentioned more than once. As an entity is mentioned in different contexts the engine will report different scores.

textrazor.Topic Object

label

Label for this topic.

score

The relevance of this topic to the processed document. This score ranges from 0 to 1, with 1 representing the highest relevance of the topic to the processed document.

wikipedia_link

Link to Wikipedia for this topic, or None if this topic couldn't be linked to a Wikipedia page.

wikidata_id

The disambiguated Wikidata QID for this topic, or None if either this topic could not be disambiguated, or a Wikidata link doesn’t exist.

Represents a single “Topic” extracted from text.

Request the "topics" extractor for this object.

textrazor.ScoredCategory Object

category_id

The unique ID for this category within its classifier.

label

The human readable label for this category.

score

The score TextRazor has assigned to this category, between 0 and 1.

To avoid false positives you might want to ignore categories below a certain score - a good starting point would be 0.5. The best way to find an appropriate threshold is to run a sample set of your documents through the system and manually inspect the results.

classifier_id

The unique identifier for the classifier that matched this category.

Represents a single “Category” that matches your document.

The classifier ID must be specified in the "classifiers" list with your analysis request.

textrazor.Entailment Object

context_score

Score representing agreement between the source word’s usage in this sentence and the entailed word's usage in our knowledgebase.

entailed_word

Word that is entailed by the source words.

matched_positions

The token positions in the current sentence that generated this entailment.

matched_words

Links to the Word in the current sentence that generated this entailment.

prior_score

The score of this entailment independent of the context it is used in this sentence.

score

The overall confidence that TextRazor is correct that this is a valid entailment, a combination of the prior and context score.

Represents a single “entailment” derived from the source text.

Please note - If you need the source word for each Entailment you must request the "words" extractor.

Request the "entailments" extractor for this object.

textrazor.RelationParam Object

entities()

Generator of all Entity mentioned in this param.

param_positions

List of the positions of the words in this param within their sentence.

param_words

List of all the Word that make up this param.

relation

Relation of this param to the predicate.

Valid Options: SUBJECT, OBJECT, OTHER

relation_parent

Relation that owns this param.

Represents a Param to a specific Relation.

Request the "relations" extractor for this object.

textrazor.NounPhrase Object

word_positions

List of the positions of the words in this phrase within their sentence.

words

List of Word that make up this phrase.

Represents a multi-word phrase extracted from a sentence.

Request the "phrases" extractor for this object.

Word Links

To extract the full text of the noun phrase from the original content you must add the "words" extractor, and use the word offsets to recreate the original string.

client = TextRazor(api_key="YOUR_API_KEY_HERE", extractors=["words","phrases"])

response = client.analyze(to_analyze)

for np in response.noun_phrases():
    print(to_analyze[np.words[0].input_start_offset: np.words[-1].input_end_offset])

textrazor.Property Object

predicate_positions

List of the positions of the words in the predicate (or focus) of this property.

predicate_words

List of TextRazor words that make up the predicate (or focus) of this property.

property_positions

List of word positions that make up the focus of this property.

property_words

List of Word that make up the property that targets the focus words.

Represents a property relation extracted from raw text. A property implies an “is-a” or “has-a” relationship between the predicate (or focus) and its property.

Request the "relations" extractor for this object.

textrazor.Relation Object

params

List of the TextRazor RelationParam of this relation.

predicate_positions

List of the positions of the predicate words in this relation within their sentence.

predicate_words

List of the TextRazor Word in this relation.

Represents a grammatical relation between words. Typically owns a number of RelationParam, representing the SUBJECT and OBJECT of the relation.

Request the "relations" extractor for this object.

Word Links

To extract the full text of the relation predicate or param from the original content you must add the "words" extractor, and use the word offsets to recreate the original string.

textrazor.Word Object

children

List of TextRazor Word that make up the children of this word. Returns an empty list for leaf words, or if the “dependency-trees” extractor was not requested.

entailments

List of Entailment that this word entails.

entities

List of Entity that this word is a part of.

input_end_offset

End offset in the input text for this token. Note that this offset applies to the original Unicode string passed in to the api, TextRazor treats multi byte UTF8 charaters as a single position.

input_start_offset

Start offset in the input text for this token. Note that this offset applies to the original Unicode string passed in to the api, TextRazor treats multi byte UTF8 charaters as a single position.

lemma

Morphological root of this word, see http://en.wikipedia.org/wiki/Lemma_(morphology) for details.

noun_phrases

List of NounPhrase that this word is a member of.

parent

Link to the TextRazor word that is parent of this word, or None if this word is either at the root of the sentence or the “dependency-trees” extractor was not requested.

parent_position

Position of the grammatical parent of this word, or None if this word is either at the root of the sentence or the “dependency-trees” extractor was not requested.

part_of_speech

Part of Speech that applies to this word. English documents use the Penn treebank tagset, as detailed here: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html All other languages that support POS tagging us the Univeral Dependency POS tagset, detailed here: http://universaldependencies.org/u/pos/

senses

List of {'sense', 'score'} maps representing scores of each Wordnet sense this this word may be a part of. This property requires the "senses" extractor to be sent with your analysis request.

spelling_suggestions

List of {'suggestion', 'score'} maps representing scores of each spelling suggestion that might replace this word. This property requires the "spelling" extractor to be sent with your analysis request.

position

Position of this word in its sentence.

property_predicates

List of Property that this word is a predicate (or focus) member of.

relation_params

List of RelationParam that this word is a member of.

relation_to_parent

Grammatical relation between this word and it’s parent, or None if this word is either at the root of the sentence or the “dependency-trees” extractor was not requested. TextRazor parses into the Stanford uncollapsed dependencies, as detailed at: http://nlp.stanford.edu/software/dependencies_manual.pdf

relations

List of Relation that this word is a predicate of.

stem

Stem of this word.

token

Raw token string that matched this word in the source text.

Represents a single Word (token) extracted by TextRazor.

Request the "words" extractor for this object.

For convenience the Python SDK automatically creates helper functions to retrieve annotations extracted from that sentence.

textrazor.Sentence Object

root_word

Root word of this sentence if “dependency-trees” extractor was requested.

words

List of all the Word in this sentence.

Represents a single sentence extracted by TextRazor.

textrazor.DictionaryManager() Object

TextRazor Entity Dictionaries allow you to augment the TextRazor entity extraction system with custom entities that are relevant to your application.

Entity Dictionaries are useful for identifying domain specific entities that may not be common enough for TextRazor to know about out of the box - examples might be Product names, Drug names, and specific person names.

TextRazor supports flexible, high performance matching of dictionaries up to several million entries, limited only by your account plan. Entries are automatically indexed and distributed across our analysis infrastructure to ensure they scale seamlessly with your application.

Once you have created a dictionary, add its ID to your analysis requests with set_entity_dictionaries . TextRazor will look for any DictionaryEntry in the dictionary that can be matched to your document, and return it as part of the standard Entity response.

Methods

create_dictionary(self, dictionary_properties)

Creates a new dictionary.

See the properties of class Dictionary for valid options.

import textrazor
manager = textrazor.DictionaryManager('YOUR_API_KEY')

manager.create_dictionary({'id':'UNIQUE_ID'})

all_dictionaries(self)

Returns a list of all Dictionary in your account.

for dictionary in manager.all_dictionaries():
     print(dictionary.id)

get_dictionary(self, id)

Returns a Dictionary object by id.

print(manager.get_dictionary('UNIQUE_ID').language)

delete_dictionary(self, id)

Deletes a dictionary and all its entries by id.

manager.delete_dictionary('UNIQUE_ID')

all_entries(self, dictionary_id, limit=None, offset=None)

Returns a AllDictionaryEntriesResponse containing all DictionaryEntry for a dictionary, along with paging information.

Larger dictionaries can be too large to download all at once. Where possible it is recommended that you use limit and offset paramaters to control the TextRazor response, rather than filtering client side.

entry_response = manager.all_entries('UNIQUE_ID', limit=10, offset=0)
for entry in entry_response.entries:
    print(entry.text)

add_entries(self, dictionary_id, entries)

Adds entries to a dictionary.

Entries must be a list corresponding to properties of the new DictionaryEntry objects. At a minimum this would be [{'text':'test text to match'}].

manager.add_entries('UNIQUE_ID', [{'text':'test text to match'}, {'text':'more text to match', 'id':'UNIQUE_ENTRY_ID'}])

get_entry(self, dictionary_id, entry_id)

Retrieves a specific DictionaryEntry by dictionary id and entry id.

print(manager.get_entry('UNIQUE_ID', 'UNIQUE_ENTRY_ID').text)

delete_entry(self, id)

Deletes a specific DictionaryEntry by dictionary id and entry id.

For performance reasons it's always faster to perform major changes to dictionaries by deleting and recreating the whole dictionary rather than removing many individual entries.

dictionary_manager.delete_entry('UNIQUE_ID', 'UNIQUE_ENTRY_ID')

Limits

Users on any of our paid plans can create up to 10 dictionaries, with a total of 10,000 entries. TextRazor supports custom dictionaries of millions of entries, please contact us to discuss increasing this limit for your account.

Free account holders are able to create 1 Dictionary with a total of 50 Entries.

textrazor.Dictionary() Object

match_type

Controls any pre-processing done on your dictionary before matching.

Defaults to 'token'.

Valid Options:

stem

Words are split and "stemmed" before matching, resulting in a more relaxed match. This is an easy way to match plurals - love, loved, loves will all match the same dictionary entry. This implicitly sets "case_insensitive" to True.

token

Words are split and matched literally.

case_insensitive

When True, this dictionary will match both uppercase and lowercase characters.

Defaults to 'False'

id

The unique identifier for this dictionary.

language

When set to a ISO-639-2 language code, this dictionary will only match documents of the corresponding language.

When set to 'any', this dictionary will match any document.

Defaults to 'any'

Represents a single Dictionary, uniqely identified by an id. Each Dictionary owns a set of DictionaryEntry.

Dictionary and DictionaryEntry can only be manipulated through the DictionaryManager object.



import textrazor
textrazor.api_key = 'YOUR_API_KEY_HERE'

manager = textrazor.DictionaryManager()
manager.create({'id':'developers'})

new_entity_type_data = {'type': ['cpp_developer']}

manager.add_entries('developers', [{'id': 'DEV1', 'text': 'Andrei Alexandrescu', 'data':new_entity_type_data}, {'id': 'DEV2', 'text':'Bjarne Stroustrup', 'data':new_entity_type_data}])

print(manager.all_entries('developers'))
manager.delete_entry('developers')

client = textrazor.TextRazor()
client.set_entity_dictionaries(['developers'])
client.set_extractors(['entities',])
response = client.analyze('Although it is very early in the process, higher-level parallelism is slated to be a key theme of the next version of C++, says Bjarne Stroustrup')

for entity in response.entities():
    print(entity.custom_entity_id)

textrazor.DictionaryEntry() Object

Represents a single dictionary entry, belonging to a Dictionary object.

id

Unique ID for this entry, used to identify and manipulate specific entries.

Defaults to an automatically generated unique id.

text

String representing the text to match to this DictionaryEntry.

data

A dictionary mapping string keys to lists of string data values. Where TextRazor matches this entry to your content in analysis, it will return the dictionary as part of the entity response.

This is useful for adding application-specific metadata to each entry. Dictionary data is limited to a maximum of 10 keys, and a total of 1000 characters across all the mapped values.

{'type':['people', 'person', 'politician']}

textrazor.ClassifierManager() Object

TextRazor can classify your documents according to the IPTC Media Topics, IPTC Newscode or IAB QAG taxonomies using our predefined models.

Sometimes the categories you might be interested in aren't well represented by off-the-shelf classifiers. TextRazor gives you the flexibility to create a customized model for your particular project.

TextRazor uses "concept queries" to define new categories. These are similar to the sort of boolean query that you might type into a search engine, except they query the semantic meaning of the document you are analyzing. Each concept query uses a word or two in English to define your category.

For an example of how to create a custom classifier please see our tutorials. If you aren't getting the results you need, please contact us, we'd be happy to help.

The ClassifierManager interface offers a simple interface for creating and managing your classifiers. Classifiers only need to be uploaded once, they are safetly stored on our servers to use with future analyze requests. Simply add the classifier name to your request's "classifiers" list.

Methods

create(self, classifier_id, categories)

Creates a new classifier using the provided list of Category.

See the properties of class Category for valid options.

import textrazor
textrazor.api_key = 'YOUR_API_KEY_HERE'

manager = textrazor.ClassifierManager()
test_classifier_id = "my_test_classifier"

manager.create_classifier(test_classifier_id, [{"category_id":"ID1","query":"or(concept('soccer'),concept('association football'))"}])

delete_classifier(self, classifier_id)

Deletes a Classifier and all its Categories by id.

manager.delete_classifier(test_classifier_id)

all_categories(self, classifier_id, limit, offset)

Returns a AllCategoriesResponse containing all Category for a classifier, along with paging information.

Larger classifiers can be too large to download all at once. Where possible it is recommended that you use limit and offset paramaters to control the TextRazor response, rather than filtering client side.

for category in manager.all_categories(test_classifier_id).categories:
    print(category.query)

delete_category(self, classifier_id, category_id)

Deletes a Category object by id.

For performance reasons it's always better to delete and recreate a whole classifier rather than its individual categories one at a time.

manager.delete_category(test_classifier_id, "ID1")

get_category(self, classifier_id, category_id)

Returns a Category object by id.

print(manager.get_category(test_classifier_id, "ID1").category_id)

Limits

Users on any of our paid plans can create up to 10 Classifiers, with a total of 1000 categories. Please contact us to discuss increasing this limit for your account.

Free account holders are able to create 1 Classifier with a total of 50 Categories.

There are no restrictions on the use of classifiers that have been pre-defined by TextRazor.

textrazor.Category Object

category_id

The unique ID for this category within its classifier.

label

The human readable label for this category. This is an optional field.

query

The query used to define this category.

Represents a single Category that belongs to a Classifier. Each category consists of a unique ID, and a query at a minimum.

{
    "categoryId" : "100",
    "label" : "Golf",
    "query" : "concept('sport>golf')"
}

textrazor.AccountManager Object

Allows you to retrieve data about your TextRazor account, designed to help manage and control your usage.

Methods

get_account()

Returns a complete Account object.

The account endpoint is read only. Calls to this endpoint do not count towards your daily quota.

textrazor.Account Object

plan

The ID of your current subscription plan.

concurrent_request_limit

The maximum number of requests your account can make simultaneously in parallel.

concurrent_requests_used

The number of requests currently being processed by your account in parallel.

plan_daily_included_requests

The daily number of requests included with your subscription plan.

requests_used_today

The total number of requests that have been made today.