REST Reference

TextRazor's API helps you rapidly build state-of-the-art language processing technology into your application.

Our main analysis endpoint offers a simple combined call that allows you to perform several different analyses on the same document, for example extracting both the entities mentioned in the text and relations between them. The API allows callers to specify a number of extractors, which control the range of TextRazor's language analysis features.

If you have any queries please contact us at support@textrazor.com and we will get back to you promptly. We'd also love to hear from you if you have any ideas for improving the API or documentation.

We offer official Client SDKs in Python, Java, and PHP. Our REST API is easy to integrate in any other language.

Endpoint

http://api.textrazor.com

Secure Endpoint

https://api.textrazor.com

Calling the API

The TextRazor API identifies each of your requests by your unique API Key, which you can find in the console.

All endpoints expect the API Key to be passed as an HTTP header - "X-TextRazor-Key: YOUR_API_KEY".

When calling the public TextRazor API you should always use the secure HTTPS endpoint to protect your credentials.

To enable compression of the TextRazor response JSON, add the Accept-encoding: 'gzip' header to your request. This can significantly reduce the size of TextRazor responses, and is recommended for most requests if possible.

curl -X POST \
    -H "x-textrazor-key: YOUR_API_KEY" \
    -d "extractors=entities,entailments" \
    -d "text=Spain's stricken Bankia expects to sell off its vast portfolio of industrial holdings that includes a stake in the parent company of British Airways and Iberia." \
    https://api.textrazor.com/

Errors

TextRazor will return an HTTP error when it encounters a problem processing your text. A more detailed message is provided in the response body.

200: OK
Request processed successfully.
400: Bad Request
Request was in an invalid format. Check that your params have been properly encoded in the POST body and that your content is UTF8 encoded.
401: Unauthorized
The API key was invalid or has fully used up its quota.
413: Request too large
The request was too large (Up to 200kb may be processed per request).

Best Practices

TextRazor was designed to work out of the box with a wide range of different types of content. However there are several steps you can take to help improve the system further for your specific application:

  • Experiment with different confidence score thresholds. Where possible, TextRazor will return scores for each of its annotations representing the amount of confidence the engine has in the result. If you prefer to avoid false-positives in your application you may want to ignore results below a certain threshold. The best way to find an appropriate threshold is to run a sample set of your documents through the system and manually inspect the results.
  • TextRazor's algorithms use the whole context of your document to understand its contents and disambiguate its analysis. Overall accuracy of the engine may be improved if long documents with multiple different themes are split up.
  • On the other hand, if you have numerous small pieces of content that are likely related, you may get better results by concatenating them before calling the API. This may be the case where you are analyzing multiple Tweets from one user, for example, or if you are separately analyzing the headline and body of a news story.

Please do not hesitate to contact us for help with getting the most out of the system for your use case.

API Reference

Analysis

POST /

Main endpoint for all analysis functions.

TextRazor expects all analysis POST requests to be "application/x-www-form-urlencoded" encoding, as used by forms on the web. Requests should include at least "url" or "text" paramaters, as well as one or more "extractors" which control the operations that TextRazor performs on your document.

Requests should also always include the "X-TextRazor-Key: YOUR_API_KEY" HTTP header to identify your account.

Analysis Options

text
Up to 200kb of UTF-8 encoded raw text to be analyzed. Either "text" or "url" must be part of your request.
url
BETA
The publicly accessible URL to fetch content from. Either "text" or "url" must be part of your request.
extractors

Sets a list of “Extractors”, which tells TextRazor which analysis functions to perform on your text. For optimal performance, only select the extractors that are explicitly required by your application.

Valid Options: entities, topics, words, phrases, dependency-trees, relations, entailments, senses
rules
String containing Prolog logic. All rules matching an extractor name listed in the request will be evaluated and all matching param combinations linked in the response.
cleanupHTML
Deprecated - Please see cleanup.mode
When True, input text is treated as raw HTML and will be cleaned of tags, comments, scripts, and boilerplate content removed. When this option is enabled, the cleaned_text property is returned with the text content, providing access to the raw filtered text. When enabled, position offsets returned in individual words apply to the clean text, not the provided HTML.
cleanup.mode

Controls the preprocessing cleanup mode that TextRazor will apply to your content before analysis. For all options aside from "raw" any position offsets returned will apply to the final cleaned text, not the raw HTML. If the cleaned text is required please see the cleanup_return_cleaned option.

Valid Options:
raw
Content is analyzed "as-is", with no preprocessing.
stripTags
All Tags are removed from the document prior to analysis. This will remove all HTML, XML tags, but the content of headings, menus will remain. This is a good option for analysis of HTML pages that aren't long form documents.
cleanHTML
Boilerplate HTML is removed prior to analysis, including tags, comments, menus, leaving only the body of the article.
cleanup.returnCleaned
When True, the TextRazor response will contain the cleaned_text property, the text it analyzed after preprocessing. To save bandwidth, only set this to True if you need it in your application. Defaults to False.
cleanup.returnRaw
When return_raw is True, the TextRazor response will contain the raw_text property, the original text TextRazor received or downloaded before cleaning. To save bandwidth, only set this to True if you need it in your application. Defaults to False.
cleanup.useMetadata
BETA

When use_metadata is True, TextRazor will use metadata extracted from your document to help in the disambiguation/extraction process. This include HTML titles and metadata, and can significantly improve results for shorter documents without much other content.

This option has no effect when cleanup_mode is 'raw'. Defaults to True.

download.userAgent
BETA

Sets the User-Agent header to be used when downloading over HTTP. This should be a descriptive string identifying your application, or an end user's browser user agent if you are performing live requests from a given user.

Defaults to "TextRazor Downloader (https://www.textrazor.com)"

languageOverride
When set to a ISO-639-2 language code, force TextRazor to analyze content with this language. If not set TextRazor will use the automatically identified language.
entities.dictionaries
BETA

Sets a list of the custom entity dictionaries to match against your content. Each item should be a string ID corresponding to dictionaries you have previously configured through the DictionaryManager interface.

entities.filterDbpediaTypes
List of DBPedia types. All returned entities must match at least one of these types. For more information on TextRazor's type filtering, see http://www.textrazor.com/types. To account for inconsistencies in DBPedia and Freebase type information we recommend you filter on multiple types across both sources where possible.
entities.filterFreebaseTypes
List of Freebase types. All returned entities must match at least one of these types. For more information on TextRazor's type filtering, see http://www.textrazor.com/types. To account for inconsistencies in DBPedia and Freebase type information we recommend you filter on multiple types across both sources where possible.
entities.allowOverlap
When True entities in the response may overlap. When False, the "best" entity is found such that none overlap. Defaults to True.
entities.enrichmentQueries
Set a list of "Enrichment Queries", used to enrich the entity response with structured linked data. The syntax for these queries is documented at https://www.textrazor.com/enrichment

Response Object

time
Total time in seconds TextRazor took to process this request. This does not include any time spent sending or recieving the request/response.
response
The output of the requested operation.
ok
True if TextRazor successfully analyzed your document, False if there was some error.
error
Descriptive error message of any problems that may have occurred during analysis, or an empty string if there was no error.
message
Any warning or informational messages returned from the server, or an empty string if there was no message.
customAnnotationOutput
Any output generated while running the embedded Prolog engine on your custom rules.
cleanedText
The processed raw text, only when cleanup_return_cleaned is enabled. When enabled position offsets returned in individual words apply to the clean text, not the provided HTML.
rawText
The raw text, only when cleanup_return_raw is enabled.
entailments
List of all Entailment across all sentences in the response.
entities
List of all the Entity across all sentences in the response.
coarseTopics
List of all the coarse Topics in the response.
topics
List of all the Topic in the response.
nounPhrases
List of all the NounPhrase in the response.
properties
List of all Property across all sentences in the response.
relations
List of all Relation across all sentences in the response.
sentences
List of all Sentence in the response.
matchingRules
List of all the rules that matched the request text.
language
The ISO-639-2 language used to analyze this document, either explicitly provided as the languageOverride, or as detected by the language detector.
languageIsReliable
Boolean indicating whether the language detector was confident of its classification. This may be false for shorter or ambiguous content.

Example

{
    "time":"TIME",
    "response":{
        "cleanedText":"CLEANED_TEXT",
        "sentences":[SENTENCE_JSON, ...],
        "entities":[ENTITY_JSON, ...],
        "topics":[TOPIC_JSON, ...],
    	"coarseTopics":[TOPIC_JSON, ...].
        "entailments":[ENTAILMENT_JSON, ...],
        "relations":[RELATION_JSON, ...],
        "properties":[PROPERTY_JSON, ...],
        "nounPhrases":[NOUNPHRASE_JSON, ...],
    }
}

Entity Object

entityId
The disambiguated ID for this entity, or None if this entity could not be disambiguated. This ID is from the localized Wikipedia for this document's language.
entityEnglishId
The disambiguated entityId in the English Wikipedia, where a link between localized and English ID could be found. None if either the entity could not be linked, or where a language link did not exist.
customEntityId
The custom entity DictionaryEntry ID that matched this Entity, if this entity was matched in a custom dictionary.
confidenceScore
The confidence that TextRazor is correct that this is a valid entity. TextRazor uses an ever increasing number of signals to help spot valid entities, all of which contribute to this score. These include the semantic agreement between the context in the source text and our knowledgebase, compatibility between other entities in the text, compatibility between the expected entity type and context, prior probabilities of having seen this entity across wikipedia and other web datasets. The score ranges from 0.5 to 10, with 10 representing the highest confidence that this is a valid entity.
type
List of Dbpedia types for this entity, or an empty list if there are none.
freebaseTypes
List of Freebase types for this entity, or an empty list if there are none.
freebaseId
The disambiguated Freebase ID for this entity, or None if either this entity could not be disambiguated, or a Freebase link doesn’t exist.
matchingTokens
List of the token positions in the current sentence that make up this entity.
matchedText
Source text string that matched this entity.
data
Dictionary containing enriched data found for this entity.
relevanceScore
Relevance this entity has to the source text. This is a float on a scale of 0 to 1, with 1 being the most relevant. Relevance is determined by the contextual similarity between the entities context and facts in the TextRazor knowledgebase.
wikiLink
Link to Wikipedia for this entity, or None if either this entity could not be disambiguated or a Wikipedia link doesn’t exist.

Represents a single “Named Entity” extracted from text.

Each entity is disambiguated to Wikipedia and Freebase concepts wherever possible. Where the entity could not be linked the relevant properties will return None.

Request the "entities" extractor for this object.

Scores

Entities are returned with both Confidence and Relevance scores when possible. These measure slightly different things. The confidence score is a measure of the engine's confidence that the entity is a valid entity given the document context, whereas the relevance score measures how on-topic or important that entity is to the document. As an example, a news story mentioning "Barack Obama" in passing would assign high confidence to the "President" entity. If the story isn't about politics, however, the same entity might have a low relevance score.

Scores can vary if the same entity is mentioned more than once. As an entity is mentioned in different contexts the engine will report different scores.

Topic Object

id
The unique id of this annotation within its annotation set.
label
Label for this topic.
score
The relevance of this topic to the processed document. This score ranges from 0 to 1, with 1 representing the highest relevance of the topic to the processed document.
wikiLink
Link to Wikipedia for this topic, or None if this topic couldn't be linked to a Wikipedia page.

Represents a single “Topic” extracted from text.

Request the "topics" extractor for this object.

Entailment Object

contextScore
Score representing agreement between the source word’s usage in this sentence and the entailed word's usage in our knowledgebase.
entailedTree
Tree containing the entailed word structure. Note - currently TextRazor only returns a single entailed word, this tree will only contain one leaf.
wordPositions
The token positions in the current sentence that generated this entailment.
priorScore
The score of this entailment independent of the context it is used in this sentence.
score
The overall confidence that TextRazor is correct that this is a valid entailment, a combination of the prior and context score.

Represents a single “entailment” derived from the source text.

Please note - If you need the source word for each Entailment you must request the "words" extractor.

Request the "entailments" extractor for this object.

RelationParam Object

wordPositions
List of the positions of the words in this param within their sentence.
relation
Relation of this param to the predicate.
Valid Options: SUBJECT, OBJECT, OTHER

Represents a Param to a specific Relation.

Request the "relations" extractor for this object.

NounPhrase Object

wordPositions
List of the positions of the words in this phrase within their sentence.

Represents a multi-word phrase extracted from a sentence.

Request the "phrases" extractor for this object.

Word Links

To extract the full text of the noun phrase from the original content you must add the "words" extractor, and use the word offsets to recreate the original string.

Property Object

wordPositions
List of the positions of the words in the predicate (or focus) of this property.
propertyPositions
List of word positions that make up the focus of this property.

Represents a property relation extracted from raw text. A property implies an “is-a” or “has-a” relationship between the predicate (or focus) and its property.

Request the "relations" extractor for this object.

Relation Object

params
List of the TextRazor RelationParam of this relation.
wordPositions
List of the positions of the predicate words in this relation within their sentence.

Represents a grammatical relation between words. Typically owns a number of RelationParam, representing the SUBJECT and OBJECT of the relation.

Request the "relations" extractor for this object.

Word Links

To extract the full text of the relation predicate or param from the original content you must add the "words" extractor, and use the word offsets to recreate the original string.

Word Object

endingPos
End offset in the input text for this token. Note that this offset applies to the original Unicode string passed in to the api, TextRazor treats multi byte UTF8 charaters as a single position.
startingPos
Start offset in the input text for this token. Note that this offset applies to the original Unicode string passed in to the api, TextRazor treats multi byte UTF8 charaters as a single position.
lemma
Morphological root of this word, see http://en.wikipedia.org/wiki/Lemma_(morphology) for details.
parentPosition
Position of the grammatical parent of this word, or None if this word is either at the root of the sentence or the “dependency-trees” extractor was not requested.
partOfSpeech
Part of Speech that applies to this word. We use the Penn treebank tagset, as detailed here: http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html
senses
List of (synset, score) tuples representing scores of each Wordnet sense this this word may be a part of.
position
Position of this word in its sentence.
relationToParent
Grammatical relation between this word and it’s parent, or None if this word is either at the root of the sentence or the “dependency-trees” extractor was not requested. TextRazor parses into the Stanford uncollapsed dependencies, as detailed at: http://nlp.stanford.edu/software/dependencies_manual.pdf
stem
Stem of this word.
token
Raw token string that matched this word in the source text.

Represents a single Word (token) extracted by TextRazor.

Request the "words" extractor for this object.

Sentence Object

words
List of all the Word in this sentence.

Represents a single sentence extracted by TextRazor.

DictionaryManager Object

TextRazor Entity Dictionaries allow you to augment the TextRazor entity extraction system with custom entities that are relevant to your application.

Entity Dictionaries are useful for identifying domain specific entities that may not be common enough for TextRazor to know about out of the box - examples might be Product names, Drug names, and specific person names.

TextRazor supports flexible, high performance matching of dictionaries up to several million entries, limited only by your account plan. Entries are automatically indexed and distributed across our analysis infrastructure to ensure they scale with your application.

Once you have created a dictionary, add its ID to your analysis requests with the 'entityDictionaries' option. TextRazor will look for any DictionaryEntry in the dictionary that can be matched to your document, and return it as part of the standard Entity response.

All match text, and other properties, are expected to be valid UTF-8. Dictionary Entries can match any UTF-8 character.

Actions

PUT BETA /entities/{ DICTIONARY_ID }

Creates a new dictionary using properties provided in the JSON body.

See the properties of class Dictionary for valid options.

curl -XPUT -H 'x-textrazor-key: YOUR_API_KEY' \
    -d '{\'matchType\':\'token\', \'caseInsensitive\':true, \'language\':\'eng\'}' \
    https://api.textrazor.com/entities/test_ents
GET BETA /entities/

Returns a list of all Dictionary in your account.

curl -H 'x-textrazor-key: YOUR_API_KEY' \
    https://api.textrazor.com/entities/
GET BETA /entities/{ DICTIONARY_ID }

Returns a Dictionary object by id.

curl -H 'x-textrazor-key: YOUR_API_KEY' \
    https://api.textrazor.com/entities/test_ents
DELETE BETA /entities/{ DICTIONARY_ID }

Deletes a dictionary and all its entries by id.

curl -XDELETE -H 'x-textrazor-key: YOUR_API_KEY' \
    https://api.textrazor.com/entities/test_ents
GET BETA /entities/{ DICTIONARY_ID }/_all?limit=20&offset=0

Returns all entities in this dictionary, along with paging information.

Larger dictionaries can be too large to download all at once. Where possible it is recommended that you use limit and offset paramaters to control the TextRazor response, rather than filtering client side.

curl -H 'x-textrazor-key: YOUR_API_KEY' \
    https://api.textrazor.com/entities/test_ents/_all?limit=20&offset=0
POST BETA /entities/{ DICTIONARY_ID }/

Adds entries to this dictionary.

Entries must be a JSON array of objects corresponding to properties of the new DictionaryEntry objects. At a minimum this would be [{'text':'test text to match'}].

Each entry assigned to a Dictionary is uniquely identified by the 'id' property. Where this is not provided a unique id is automatically generated by the server. When an entity is added with an id that matches an existing entity, the old entity is overwritten. This behaviour allows you to keep your entity dictionaries up-to-date without any downtime from deleting and recreating dictionaries.

curl -XPOST -H 'x-textrazor-key: YOUR_API_KEY' \
    -d "[{\"text\":\"Bjarne Stroustrup\",\"id\":\"DEV2\"}]" https://api.textrazor.com/entities/test_ents/
GET BETA /entities/{ DICTIONARY_ID }/{ ENTITY_ID }

Gets a specific entity by id.

curl -H 'x-textrazor-key: YOUR_API_KEY' \
    https://api.textrazor.com/entities/test_ents/DEV2
DELETE BETA /entities/{ DICTIONARY_ID }/{ ENTITY_ID }

Deletes a specific entity by id.

For performance reasons it's always better to delete and recreate a whole dictionary rather than its individual entries one at a time.

curl -XDELETE -H 'x-textrazor-key: YOUR_API_KEY' \
    https://api.textrazor.com/entities/test_ents/DEV2

Limits

Entity Dictionaries are currently in Beta. During the beta period, users on any of our paid plans can create up to 10 dictionaries, with a total of 10,000 entries. TextRazor supports custom dictionaries of millions of entries, please contact us to discuss increasing this limit for your account.

Free account holders are able to create 1 Dictionary with a total of 50 Entries.

Dictionary Object

matchType
BETA

Controls any pre-processing done on your dictionary before matching.

Defaults to 'token'.

Valid Options:
stem
Words are split and "stemmed" before matching, resulting in a more relaxed match. This is an easy way to match plurals - love, loved, loves will all match the same dictionary entry. This implicitly sets "case_insensitive" to True.
token
Words are split and matched literally.
caseInsensitive
BETA

When True, this dictionary will match both uppercase and lowercase characters.

Defaults to 'True'

id
BETA

The unique identifier for this dictionary.

language
BETA

When set to a ISO-639-2 language code, this dictionary will only match documents of the corresponding language.

When set to 'any', this dictionary will match any document.

Defaults to 'any'

Represents a single Dictionary, uniqely identified by an id. Each Dictionary owns a set of DictionaryEntry.

Dictionary and DictionaryEntry can only be manipulated through the DictionaryManager object.

DictionaryEntry Object

Represents a single dictionary entry, belonging to a Dictionary object.

id
BETA

Unique ID for this entry, used to identify and manipulate specific entries.

Defaults to an automatically generated unique id.

text
BETA

String representing the text to match to this DictionaryEntry.

data
BETA

A dictionary mapping string keys to lists of string data values. Where TextRazor matches this entry to your content in analysis, it will return the dictionary as part of the entity response.

This is useful for adding application-specific metadata to each entry.

{'type':['people', 'person', 'politician']}