TextRazor/HTTP

TextRazor is a REST API, supported by any language that can send an HTTP request. We provide several language specific SDKs, but support any other language that can send an HTTP request. HTTP POST requests should be sent to the following endpoint:

http://api.textrazor.com/

We also offer a secure SSL endpoint at:

https://api.textrazor.com/

All responses are provided in JSON.

Each request must provide the "text" param, the "apiKey" param to identify your application, and the "extractors" param to tell TextRazor which extraction operations you'd like to apply to your text. Each extractor then has further options to help customize the extraction process.

Compression

To enable compression of the TextRazor response JSON, add the Accept-encoding: 'gzip' header to your request. This can significantly reduce the size of TextRazor responses, and is recommended for most requests if possible.

Errors

TextRazor returns HTTP errors if there is a problem processing your text. A more detailed message is provided in the response body.

200: OK

Input was OK.

400: Bad Request

The request was in an invalid format.

401: Unauthorized

The API key was invalid or has no requests left.

413: Request Too Large

The request was too large (Up to 200kb may be processed per request).

Example

curl -X POST \
	-d "apiKey=DEMO" \
	-d "extractors=entities,entailments" \
	-d "text=Spain's stricken Bankia expects to sell off its vast portfolio of industrial holdings that includes a stake in the parent company of British Airways and Iberia." \
	http://api.textrazor.com/
				

Request Options

text

Up to 200kb of UTF-8 encoded raw text to be analyzed.

apiKey

Your TextRazor API key, get one here.

extractors Valid Options: entities, topics, words, phrases, dependency-trees, relations, entailments, senses

A single request can perform multiple extraction operations. To reduce bandwidth and processing time you should only request the operations explicitly required by your application.

languageOverride When set to a ISO-639-2 language code, force TextRazor to analyze content with this language. If not set TextRazor will use the automatically identified language.
cleanupHTML Valid Options: True, False
When True, input text is treated as raw HTML and will be cleaned of tags, comments, scripts, and boilerplate content removed. When this option is enabled, the cleanedText property is returned with the text content, providing access to the raw filtered text. When enabled, position offsets returned in individual words apply to the clean text, not the provided HTML
entities.filterDbpediaTypes A comma separated list of DBPedia types to filter entity extraction on. All returned entities must match at least one of these types. See the Type Dictionary for more details on supported types.
entities.filterFreebaseTypes A comma separated list of Freebase types to filter entity extraction on. All returned entities must match at least one of these types. See the Type Dictionary for more details on supported types.

Response Format

TextRazor returns JSON for all queries. The response contains the extracted data as requested by the extractors of the request, as well as an array of sentences.

{
    "time":"TIME",
    "response":{
        "cleanedText":"CLEANED_TEXT",
        "sentences":[SENTENCE_JSON, ...],
        "entities":[ENTITY_JSON, ...],
        "topics":[TOPIC_JSON, ...],
    	"coarseTopics":[TOPIC_JSON, ...].
        "words":[WORD_JSON, ...],
        "entailments":[ENTAILMENT_JSON, ...],
        "relations":[RELATION_JSON, ...],
        "properties":[PROPERTY_JSON, ...],
        "nounPhrases":[NOUNPHRASE_JSON, ...],
        
    }
}  
time

Total time in seconds TextRazor took to process this request. This does not include any time spent sending or recieving the request/response.

response

The main response.

cleanedText

When cleanupHTML is requested, returns the input text after filtering.

language

The ISO-639-2 language used to analyze this document, either explicitly provided as the languageOverride, or as detected by the language detector.

languageIsReliable

Boolean indicating whether the language detector was confident of its classification. This may be false for shorter or ambiguous content.

Entities

Requires the "entities" extractor to be added to the TextRazor request.

entityId

The disambiguated ID for this entity, or None if this entity could not be disambiguated. This ID is from the localized Wikipedia for this document's language.

entityEnglishId

The disambiguated entityId in the English Wikipedia, where a link between localized and English ID could be found. None if either the entity could not be linked, or where a language link did not exist.

freebaseId

The disambiguated Freebase ID for this entity, or None if either this entity could not be disambiguated, or a Freebase link doesn't exist.

wikiLink

Link to Wikipedia for this entity, or None if either this entity could not be disambiguated or a Wikipedia link doesn't exist.

matchedText

The source text string that matched this entity.

matchingTokens

Array of the token positions in the current sentence that make up this entity.

freebaseTypes

Array of Freebase types for this entity, or an empty array if there are none.

type

Array of DBPedia types for this entity, or an empty array if there are none.

relevanceScore

The relevance this entity has to the source text. This is a float on a scale of 0 to 1, with 1 being the most relevant. Relevance is determined by the contextual similarity between the entities context and facts in the TextRazor knowledgebase.

confidenceScore

The confidence that TextRazor is correct that this is a valid entity. TextRazor uses an ever increasing number of signals to help spot valid entities, all of which contribute to this score. These include the contextual agreement between the words in the source text and our knowledgebase, agreement between other entities in the text, agreement between the expected entity type and context, prior probabilities of having seen this entity across wikipedia and other web datasets. The score ranges from 0.5 to 10, with 10 representing the highest confidence that this is a valid entity.

Topics

Requires the "topics" extractor to be added to the TextRazor request.

label

The textual label for this topic.

score

The relevance of this topic to the processed document. This score ranges from 0 to 1, with 1 representing the highest relevance of the topic to the processed document.

wikiLink

Link to the wikipedia page for this Topic, empty if there is no relevant link for this topic.

Coarse Topics

Requires the "topics" extractor to be added to the TextRazor request.

label

The textual label for this topic.

score

The relevance of this topic to the processed document. This score ranges from 0 to 1, with 1 representing the highest relevance of the topic to the processed document.

wikiLink

Link to the wikipedia page for this Topic, empty if there is no relevant link for this topic.

Entailment

Requires the "entailments" extractor to be added to the TextRazor request.

wordPositions

The token positions in the current sentence that generated this entailment.

priorScore

The score of this entailment independent of the context it is used in this sentence.

contextScore

The score of agreement between the source word's usage in this sentence and the entailed words usage in our knowledgebase.

score

the overall confidence that TextRazor is correct that this is a valid entailment, a combination of the prior and context score.

entailedTree

The words that are entailed by the source words

Word

Requires the "words" extractor to be added to the TextRazor request.

parentPosition

Returns the position of the grammatical parent of this word, or None if this word is either at the root of the sentence or the "dependency-trees" extractor was not requested.

relationToParent

Returns the Grammatical relation between this word and it's parent, or None if this word is either at the root of the sentence or the "dependency-trees" extractor was not requested. TextRazor parses into Stanford uncollapsed dependencies.

position

The position of this word in its sentence.

senses

Returns an array of Sense objects, corresponding to scores that each word belongs to a specific Wordnet3.1 sense. Each Object has a "synset" and "score" property. Requires the "senses" extractor to be added to the request.

stem

The stem of this word.

lemma

Returns the morphological root of this word

token

The raw token string that matched this word in the source text.

partOfSpeech

the Part of Speech that applies to this word. We use the Penn treebank tagset.

startingPos

The start offset in the input text for this token. Note that TextRazor treats multi byte utf8 charaters as a single position.

endingPos

The end offset in the input text for this token. Note that TextRazor treats multi byte utf8 charaters as a single position.

Relation

Requires the "relations" extractor to be added to the TextRazor request.

wordPositions

An array of the positions of the predicate words in this relation within their sentence.

params

An array of RelationParam, representing the parameters to this relation.

RelationParam

Requires the "relations" extractor to be added to the TextRazor request. Child of the Relation object.

relation

The relation of this param to the predicate: Possible values: SUBJECT, OBJECT, OTHER.

wordPositions

An array of the positions of the param words in this relation within their sentence.

NounPhrase

Requires the "relations" extractor to be added to the TextRazor request.

wordPositions

An array of the positions of the words in this phrase within their sentence.

Property

Requires the "relations" extractor to be added to the TextRazor request.

wordPositions

Returns a list of the positions of the words in the predicate (or focus) of this property.

propertyPositions

Returns a list of word positions that make up the modifier of the predicate of this property.