TextRazor's API helps you rapidly build state-of-the-art language processing technology into your application.
Our main analysis endpoint offers a simple combined call that allows you to perform several different analyses on the same document, for example extracting both the entities mentioned in the text and relations between them. The API allows callers to specify a number of extractors, which control the range of TextRazor's language analysis features.
The Python SDK builds on top of our REST API, adding Pythonic wrappers around TextRazor annotations, automatically adding links between them where possible.
On this page you can find a full API reference for the TextRazor Python API. If you're looking to get started quickly, have a look at the tutorials for some simple examples.
If you have any queries please contact us at support@textrazor.com and we will get back to you promptly. We'd also love to hear from you if you have any ideas for improving the API or documentation.
We offer official Client SDKs in Python, Java, and PHP. Our REST API is easy to integrate in any other language.
Get started with just a few lines of Python:
import textrazor textrazor.api_key = "API_KEY_GOES_HERE" client = textrazor.TextRazor(extractors=["entities", "topics"]) response = client.analyze_url("http://www.bbc.co.uk/news/uk-politics-18640916") for entity in response.entities(): print(entity.id, entity.relevance_score, entity.confidence_score, entity.freebase_types)
The TextRazor Python SDK is self contained in a single file with no external dependencies. You can pick up the latest source from GitHub.
The easiest way to get started is with pip:
Alternatively you can simply copy textrazor.py into your own project.
The TextRazor API identifies each of your requests by your unique API Key, which you can find in the console.
To set the TextRazor API key globally for all requests, set the textrazor.api_key = 'API_KEY_GOES_HERE'
at the start of your application.
Alternatively, the API key and other connection options can be set when creating each service object.
By default the Python client uses SSL connections to encrypt all communication with the TextRazor server.
The TextRazor Python SDK throws aTextRazorAnalysisException
whenever it is unable to process your request for any reason.
This comes with a descriptive message explaining the problem.
TextRazor errors are reported with their corresponding HTTP error codes and a descriptive message:
200: OK
400: Bad Request
401: Unauthorized
413: Request too large
500: Internal Server Error
Please check the TextRazor status page for updates on any system-wide issues.
To handle any intermittent connection issues we recommend that you design your application to gracefully retry sending failed requests several times before stopping and logging the error. TextRazor operates a resilient high-availability infrastructure, so any potential disruption should be kept to a minimum.
try: response = client.analyze_url("http://www.bbc.co.uk/news/uk-politics-18640916") except TextRazorAnalysisException, ex: print "Failed to analyze with error: ", ex
TextRazor was designed to work out of the box with a wide range of different types of content. However there are several steps you can take to help improve the system further for your specific application:
Please do not hesitate to contact us for help with getting the most out of the system for your use case.
All TextRazor analysis functionality is exposed in the textrazor.TextRazor
class. To process a document, create a textrazor.TextRazor
instance with your
api key and the extractors you are interested in. Calls to analyze()
and analyze_url()
will then process a string or URL.
This class is threadsafe once initialized with the request options. You should create a new instance for each request if you are likely to be changing the request options in a multithreaded environment.
Note that for performance reasons some of the fields of each annotation are Python properties, and some methods. You can spot the difference in the documentation - the entries with parenthesis and params are methods.
The Python SDK expects all strings to be Python unicode strings. If you are using Python 2.x and reading from a utf-8 encoded source
you must first decode it using text.decode('utf-8')
.
analyze(text)
Calls the TextRazor API with the provided unicode text.
Returns a TextRazorResponse
with the parsed data on success. Raises a TextRazorAnalysisException
on failure.
analyze_url(url)
Calls the TextRazor API with the provided url.
TextRazor will first download the contents of this URL, and then process the resulting text.
TextRazor will only attempt to analyze text documents. Any invalid UTF-8 characters will be replaced with a space character and ignored. TextRazor limits the total download size to approximately 1M. Any larger documents will be truncated to that size, and a warning will be returned in the response.
By default, TextRazor will clean all HTML prior to processing. For more control of the cleanup process,
see the set_cleanup_mode
option.
Returns a TextRazorResponse
with the parsed data on success.
Raises a TextRazorAnalysisException
on failure.
TextRazor's response contains a "raw" Python object. This can be serialized and reconstructed later on, making it easy to save TextRazor objects:
response = client.analyze_url("http://www.bbc.co.uk/news/uk-politics-18640916") json_content = response.json new_response = textrazor.TextRazorResponse(json_content)
set_extractors(extractors)
Sets a list of “Extractors”, which tells TextRazor which analysis functions to perform on your text. For optimal performance, only select the extractors that are explicitly required by your application.
set_rules(rules)
set_do_cleanup_HTML(cleanup_html)
set_cleanup_mode(cleanup_mode)
Controls the preprocessing cleanup mode that TextRazor will apply to your content before analysis. For all options aside from "raw" any position offsets returned will apply to the final cleaned text, not the raw HTML. If the cleaned text is required please see the cleanup_return_cleaned option.
set_cleanup_return_cleaned(self, return_cleaned)
set_cleanup_return_raw(return_raw)
set_cleanup_use_metadata(use_metadata)
When use_metadata is True, TextRazor will use metadata extracted from your document to help in the disambiguation/extraction process. This include HTML titles and metadata, and can significantly improve results for shorter documents without much other content.
This option has no effect when cleanup_mode is 'raw'. Defaults to True.
set_download_user_agent(user_agent)
Sets the User-Agent header to be used when downloading over HTTP. This should be a descriptive string identifying your application, or an end user's browser user agent if you are performing live requests from a given user.
Defaults to "TextRazor Downloader (https://www.textrazor.com)"
set_language_override(language_override)
set_do_compression(do_compression)
set_do_encryption(do_encryption)
set_entity_dictionaries(self, entity_dictionaries)
Sets a list of the custom entity dictionaries to match against your content. Each item should be a string ID corresponding to dictionaries you have previously configured through the DictionaryManager interface.
set_entity_dbpedia_type_filters(filters)
set_entity_freebase_type_filters(filters)
set_entity_allow_overlap(allow_overlap)
set_classifiers(self, classifiers)
Sets a list of classifiers to evaluate against your document. Each entry should be a string ID corresponding to either one of TextRazor's default classifiers, or one you have previously configured through the ClassifierManager interface.
If you aren't tied to a particular taxonomy version, the current textrazor_mediatopics_2023Q1 is a sound starting point for many classification projects.
ok
error
message
custom_annotation_output()
cleaned_text
raw_text
entailments()
Entailment
across all sentences in the response.
entities()
Entity
across all sentences in the response.
topics()
Topic
in the response.
categories()
ScoredCategory
in the response.
noun_phrases()
NounPhrase
in the response.
properties()
Property
across all sentences in the response.
relations()
Relation
across all sentences in the response.
sentences
Sentence
in the response.
matching_rules()
words()
Word
across all sentences in the response.
language
language_is_reliable
id
english_id
custom_entity_id
confidence_score
dbpedia_types
freebase_types
freebase_id
wikidata_id
matched_positions
matched_words
Word
that make up this entity.
matched_text
data
relevance_score
wikipedia_link
Represents a single “Named Entity” extracted from text.
Each entity is disambiguated to Wikipedia and Freebase concepts wherever possible. Where the entity could not be linked the relevant properties will return None.
Entities are returned with both Confidence and Relevance scores when possible. These measure slightly different things. The confidence score is a measure of the engine's confidence that the entity is a valid entity given the document context, whereas the relevance score measures how on-topic or important that entity is to the document. As an example, a news story mentioning "Barack Obama" in passing would assign high confidence to the "President" entity. If the story isn't about politics, however, the same entity might have a low relevance score.
Scores can vary if the same entity is mentioned more than once. As an entity is mentioned in different contexts the engine will report different scores.
label
score
wikipedia_link
wikidata_id
Represents a single “Topic” extracted from text.
category_id
label
score
The score TextRazor has assigned to this category, between 0 and 1.
To avoid false positives you might want to ignore categories below a certain score - a good starting point would be 0.5. The best way to find an appropriate threshold is to run a sample set of your documents through the system and manually inspect the results.
classifier_id
Represents a single “Category” that matches your document.
context_score
entailed_word
matched_positions
matched_words
Word
in the current sentence that generated this entailment.
prior_score
score
Represents a single “entailment” derived from the source text.
Please note - If you need the source word for each Entailment
you must request
the "words" extractor.
entities()
Entity
mentioned in this param.
param_positions
param_words
Word
that make up this param.
relation
relation_parent
Relation
that owns this param.
word_positions
words
Word
that make up this phrase.
Represents a multi-word phrase extracted from a sentence.
To extract the full text of the noun phrase from the original content you must add the "words" extractor, and use the word offsets to recreate the original string.
client = TextRazor(api_key="YOUR_API_KEY_HERE", extractors=["words","phrases"]) response = client.analyze(to_analyze) for np in response.noun_phrases(): print(to_analyze[np.words[0].input_start_offset: np.words[-1].input_end_offset])
predicate_positions
predicate_words
property_positions
property_words
Word
that make up the property that targets the focus words.
Represents a property relation extracted from raw text. A property implies an “is-a” or “has-a” relationship between the predicate (or focus) and its property.
params
RelationParam
of this relation.
predicate_positions
predicate_words
Word
in this relation.
Represents a grammatical relation between words. Typically owns a number of RelationParam
, representing the SUBJECT and OBJECT of the relation.
To extract the full text of the relation predicate or param from the original content you must add the "words" extractor, and use the word offsets to recreate the original string.
children
Word
that make up the children of this word. Returns an empty list for leaf words, or if the “dependency-trees” extractor was not requested.
entailments
Entailment
that this word entails.
entities
Entity
that this word is a part of.
input_end_offset
input_start_offset
lemma
noun_phrases
NounPhrase
that this word is a member of.
parent
parent_position
part_of_speech
senses
spelling_suggestions
position
property_predicates
Property
that this word is a predicate (or focus) member of.
relation_params
RelationParam
that this word is a member of.
relation_to_parent
relations
Relation
that this word is a predicate of.
stem
token
Represents a single Word (token) extracted by TextRazor.
For convenience the Python SDK automatically creates helper functions to retrieve annotations extracted from that sentence.
root_word
words
Word
in this sentence.
Represents a single sentence extracted by TextRazor.
TextRazor Entity Dictionaries allow you to augment the TextRazor entity extraction system with custom entities that are relevant to your application.
Entity Dictionaries are useful for identifying domain specific entities that may not be common enough for TextRazor to know about out of the box - examples might be Product names, Drug names, and specific person names.
TextRazor supports flexible, high performance matching of dictionaries up to several million entries, limited only by your account plan. Entries are automatically indexed and distributed across our analysis infrastructure to ensure they scale seamlessly with your application.
Once you have created a dictionary, add its ID to your analysis requests with set_entity_dictionaries
. TextRazor will look for any DictionaryEntry in the dictionary that
can be matched to your document, and return it as part of the standard Entity response.
create_dictionary(self, dictionary_properties)
Creates a new dictionary.
See the properties of class Dictionary for valid options.
import textrazor manager = textrazor.DictionaryManager('YOUR_API_KEY') manager.create_dictionary({'id':'UNIQUE_ID'})
all_dictionaries(self)
Returns a list of all Dictionary in your account.
for dictionary in manager.all_dictionaries(): print(dictionary.id)
get_dictionary(self, id)
Returns a Dictionary object by id.
print(manager.get_dictionary('UNIQUE_ID').language)
delete_dictionary(self, id)
Deletes a dictionary and all its entries by id.
manager.delete_dictionary('UNIQUE_ID')
all_entries(self, dictionary_id, limit=None, offset=None)
Returns a AllDictionaryEntriesResponse containing all DictionaryEntry for a dictionary, along with paging information.
Larger dictionaries can be too large to download all at once. Where possible it is recommended that you use limit and offset paramaters to control the TextRazor response, rather than filtering client side.
entry_response = manager.all_entries('UNIQUE_ID', limit=10, offset=0) for entry in entry_response.entries: print(entry.text)
add_entries(self, dictionary_id, entries)
Adds entries to a dictionary.
Entries must be a list corresponding to properties of the new DictionaryEntry objects. At a minimum this would be [{'text':'test text to match'}].
manager.add_entries('UNIQUE_ID', [{'text':'test text to match'}, {'text':'more text to match', 'id':'UNIQUE_ENTRY_ID'}])
get_entry(self, dictionary_id, entry_id)
Retrieves a specific DictionaryEntry by dictionary id and entry id.
print(manager.get_entry('UNIQUE_ID', 'UNIQUE_ENTRY_ID').text)
delete_entry(self, id)
Deletes a specific DictionaryEntry by dictionary id and entry id.
For performance reasons it's always faster to perform major changes to dictionaries by deleting and recreating the whole dictionary rather than removing many individual entries.
dictionary_manager.delete_entry('UNIQUE_ID', 'UNIQUE_ENTRY_ID')
Users on any of our paid plans can create up to 10 dictionaries, with a total of 10,000 entries. TextRazor supports custom dictionaries of millions of entries, please contact us to discuss increasing this limit for your account.
Free account holders are able to create 1 Dictionary with a total of 50 Entries.
match_type
Controls any pre-processing done on your dictionary before matching.
Defaults to '
'.case_insensitive
When True, this dictionary will match both uppercase and lowercase characters.
Defaults to '
'id
The unique identifier for this dictionary.
language
When set to a ISO-639-2 language code, this dictionary will only match documents of the corresponding language.
When set to 'any', this dictionary will match any document.
Defaults to '
'Represents a single Dictionary, uniqely identified by an id. Each Dictionary owns a set of DictionaryEntry.
Dictionary and DictionaryEntry can only be manipulated through the DictionaryManager object.
import textrazor textrazor.api_key = 'YOUR_API_KEY_HERE' manager = textrazor.DictionaryManager() manager.create({'id':'developers'}) new_entity_type_data = {'type': ['cpp_developer']} manager.add_entries('developers', [{'id': 'DEV1', 'text': 'Andrei Alexandrescu', 'data':new_entity_type_data}, {'id': 'DEV2', 'text':'Bjarne Stroustrup', 'data':new_entity_type_data}]) print(manager.all_entries('developers')) manager.delete_entry('developers') client = textrazor.TextRazor() client.set_entity_dictionaries(['developers']) client.set_extractors(['entities',]) response = client.analyze('Although it is very early in the process, higher-level parallelism is slated to be a key theme of the next version of C++, says Bjarne Stroustrup') for entity in response.entities(): print(entity.custom_entity_id)
Represents a single dictionary entry, belonging to a Dictionary object.
id
Unique ID for this entry, used to identify and manipulate specific entries.
Defaults to an automatically generated unique id.
text
String representing the text to match to this DictionaryEntry.
data
A dictionary mapping string keys to lists of string data values. Where TextRazor matches this entry to your content in analysis, it will return the dictionary as part of the entity response.
This is useful for adding application-specific metadata to each entry. Dictionary data is limited to a maximum of 10 keys, and a total of 1000 characters across all the mapped values.
{'type':['people', 'person', 'politician']}
TextRazor can classify your documents according to the IPTC Media Topics, IPTC Newscode or IAB QAG taxonomies using our predefined models.
Sometimes the categories you might be interested in aren't well represented by off-the-shelf classifiers. TextRazor gives you the flexibility to create a customized model for your particular project.
TextRazor uses "concept queries" to define new categories. These are similar to the sort of boolean query that you might type into a search engine, except they query the semantic meaning of the document you are analyzing. Each concept query uses a word or two in English to define your category.
For an example of how to create a custom classifier please see our tutorials. If you aren't getting the results you need, please contact us, we'd be happy to help.
The ClassifierManager interface offers a simple interface for creating and managing your classifiers. Classifiers only need to be uploaded once, they are safetly stored on our servers to use with future analyze requests. Simply add the classifier name to your request's "classifiers" list.
create(self, classifier_id, categories)
Creates a new classifier using the provided list of Category.
See the properties of class Category for valid options.
import textrazor textrazor.api_key = 'YOUR_API_KEY_HERE' manager = textrazor.ClassifierManager() test_classifier_id = "my_test_classifier" manager.create_classifier(test_classifier_id, [{"category_id":"ID1","query":"or(concept('soccer'),concept('association football'))"}])
delete_classifier(self, classifier_id)
Deletes a Classifier and all its Categories by id.
manager.delete_classifier(test_classifier_id)
all_categories(self, classifier_id, limit, offset)
Returns a AllCategoriesResponse containing all Category for a classifier, along with paging information.
Larger classifiers can be too large to download all at once. Where possible it is recommended that you use limit and offset paramaters to control the TextRazor response, rather than filtering client side.
for category in manager.all_categories(test_classifier_id).categories: print(category.query)
delete_category(self, classifier_id, category_id)
Deletes a Category object by id.
For performance reasons it's always better to delete and recreate a whole classifier rather than its individual categories one at a time.
manager.delete_category(test_classifier_id, "ID1")
get_category(self, classifier_id, category_id)
Returns a Category object by id.
print(manager.get_category(test_classifier_id, "ID1").category_id)
Users on any of our paid plans can create up to 10 Classifiers, with a total of 1000 categories. Please contact us to discuss increasing this limit for your account.
Free account holders are able to create 1 Classifier with a total of 50 Categories.
There are no restrictions on the use of classifiers that have been pre-defined by TextRazor.
category_id
label
query
Represents a single Category that belongs to a Classifier. Each category consists of a unique ID, and a query at a minimum.
{ "categoryId" : "100", "label" : "Golf", "query" : "concept('sport>golf')" }
Allows you to retrieve data about your TextRazor account, designed to help manage and control your usage.
get_account()
Returns a complete Account object.
The account endpoint is read only. Calls to this endpoint do not count towards your daily quota.
plan
concurrent_request_limit
concurrent_requests_used
plan_daily_included_requests
requests_used_today