Tailor Made Text Mining

Customization and domain adaptation is often crucial to the development of accurate text analytics applications, but building domain specific logic directly into an application often leads to complex and unmaintainable code.

TextRazor allows both developers and domain experts to rapidly add problem specific logic, integrating seamlessly with the TextRazor API and the rest of your application.

Traditional rule based approaches to text mining are often brittle and over complex. TextRazor is different. Our statistical machine learning algorithms create a powerful abstraction over which to add your custom logic, making even simple rules robust to changes in sentence structure or language style.

Named entities, topic tags, synonyms, grammatical syntax - TextRazor's extensive set of algorithms can all be combined and filtered with your own keywords and logic into your own customized system.

Prolog

TextRazor uses Prolog as its rules engine. Our system was designed to make common patterns easy to build without technical expertise, while keeping the full power and expressiveness of the Prolog language for more complex tasks. We've extended the standard Prolog interpreter with an extensive library of helper rules.

  • Get started writing powerful extraction rules without any programming knowledge.
  • Intuitive and expressive syntax for pattern matching.
  • Separate text extraction logic from your application, making maintenance and enhancement easy. Allow domain experts complete control over the text analysis aspect of your system.
  • Prune and filter results before they are returned to you, resulting in significant savings in bandwidth and latency.
  • Open source core engine, industry standard rules language.
Custom rules run on our infrastructure in a managed environment, automatically scaled up and down with your requests.

Concepts

Integration with the rest of your project is easy. TextRazor hides all the complex interfaces with the Prolog interpreter, and intelligently binds the links between your rules and other words, entities etc, minimizing boilerplate code.

  • TextRazor processes your document, extracting words, sentences, entities, synonyms and other semantic metadata as required by your rules.
  • These are exposed as Prolog facts.
  • Your rules are parsed, looking for those named in the "extractors" list you've requested. The names of any arguments to the rules are derived automatically where possible.
  • These rules are applied to the facts extracted from your document.
  • Every match is collected, the type of the parameter is automatically deduced.

Sample Applications

Classification

TextRazor's statistical topic classifier and entity extractor provide a powerful general purpose tagging solution for search, monitoring, semantic tagging and content creation applications. You may, however, have an existing category ontology or a specific set of topics you're interested in identifying.

In the following rules we're aiming to classify web articles as being about soccer or olympic soccer. You'll see how we make the most of multiple TextRazor annotations to help make a decision about whether this document is about football or not. You'll also see how simple rules can be easily combined into concise but powerful extraction algorithms.

% Any document with an explicit 'Football' tag is about_football.
about_football :-
	topic('Association football').

% Or a document that simply mentions the word 'soccer'.
about_football :-
	token('soccer').

% Or a document that has been given the 'Leisure' category, and mentions some general
% ball game words, but which doesn't mention other ball games we're not interested in.
about_football :-
	coarse_topic('Leisure'),
	or(
		(lemma('kick'), lemma('ball')),
		(lemma('pass'))
	),
	not lemma('quarterback'),
	not lemma('rugby').

% Or any document that explicitly mentions a soccer player (any will do!)
about_football :-
	entity_freebase_type(EntityId, '/​people/​person/​soccer/​football_player/​').

% Any document that has been tagged as about_football, and which also mentions
% the olympics, is about_olympic_football.
about_olympic_football :-
	about_football,
	or(
		sequence('London', '2012'),
		lemma('olympic')
	).
				

Custom Entities/Product Names

If you're interested in extracting facts about a specific company or their products our general purpose knowledgebase may not have enough information to disambiguate them automatically. We can add product names, codes and disambiguating information to TextRazor with custom rules. Here we're adding the 'Product' paramater to our rule, which is matched in the rule body and returned to your application. Note that Prolog treats all words starting with a capital letter as a variable, whenever the rule you have requested matches a variable, all combinations are returned in the response.

% Match a product of interest as any proper noun followed by the word 'Meal',
% contained in a document that we think is about mcdonalds.
interesting_product(Product) :-
	sequence(part_of_speech(Product, 'NNP'), 'Meal'),
	about_mcdonalds.

about_mcdonalds :-
	entity_id('Mcdonalds').

about_mcdonalds :-
	sequence('Big', 'Mac').

about_mcdonalds :-
	lemma('mcdonalds').

			

Debugging/Diagnostics

Error and warning information from TextRazor and the Prolog interpreter are returned to the caller in the 'customAnnotationOutput' field of the main response.

This field also contains any printed output from your program. To add to the output you can call write(+ToWrite), from your program. nl. writes a newline.

Dependencies

TextRazor intelligently computes the processing dependencies of your program, they do not need to be explicitly specified as extractors if you don't need them. For example, if your script references named entities TextRazor will automatically run its entity recognition logic without you needing to manually add the "entities" extractor to your request.

Bindings and the client libraries

In many applications we're interested in identifying more than whether a certain rule matched or not. TextRazor allows to return additional context from your rules though their arguments. In the above example interesting_product has a "Product" argument, which is matched in the rule. When this rule is triggered in your document, the matching parameter value is returned to your application. This has a number of uses, for classification projects you can return multiple tags in the same rule. For extraction projects you can precisely identify words, entities or other annotations of interest.

The name of the paramater is returned in the response, where an explicit name can't be automatically derived, the paramater is named 'Anonymous'. All unique sets of parameters that match your rules will be returned to you in the TextRazor response. TextRazor automatically identifies the type of the parameter and returns the equivalent type in the response.

Where a TextRazor internal ID is matched (for example a Word ID or Entity ID), a link is generated to the relevant annotation in the response.

Lists in the output are automatically flattened to a single list, so for example president([['Barack'], ['Obama']])., will be interpreted the same as president(['Barack', 'Obama']).. This makes it easy to combine the output of several rules without worrying about the format of the output.

Restrictions

In order to ensure the scalability of your requests, and to preserve the integrity of the TextRazor backend, we place limits on certain functionality from your rules.

  • No filesystem access beyond the YAP standard library.
  • 500mb memory cap per request.
  • 3 second request timeout per request.
  • Limited access to system calls and machine hardware.
We find that the vast majority of applications fit within this system, but if you have more specific requirements please get in touch at support@textrazor.com.

Reference

TextRazor's full set of extracted metadata instances is exposed through a range of Prolog predicates and built in helper rules. Each instance has an ID, which can be used to join with others. When an instance ID is returned in a parameter it's returned to the caller and linked to the relevant type in the client API.

For convenience each rule has two forms, linking the fact to its ID, and a convenience method that just checks for the existence of the fact without matching the ID.

TextRazor runs standard ISO Prolog using the YAP interpreter. This makes the whole YAP standard library available for use in your rules where needed, read more here.


or(Expression,...).

Must match at least one of Expression. This is strictly equivalent to the prolog (ExpressionA ; ExpressionB; ExpressionC; ...). Note that any rules with the same name are implicitly treated as an 'or', it may be cleaner to separate complex or statements into multiple rules.

and(Expression,...).

Must match all of Expression. This is strictly equivalent to the Prolog (ExpressionA, ExpressionB, ExpressionC,...). Multiple terms in the same rule separated by a ',' are implicitly treated as an 'and'.


token(TokenId,Token).
token(Token).

Matches a single raw token string.

part_of_speech(TokenId,PartOfSpeech).
part_of_speech_overlap(OtherAnnotationId,PartOfSpeech).
part_of_speech(PartOfSpeech).

Matches a token's Part of Speech part_of_speech_overlap matches other TextRazor annotations that have this PartOfSpeech.

position(TokenId,Position).
position_overlap(OtherAnnotationId,Position).

Matches a token's position in the document. position_overlap matches other TextRazor annotations' position.

stem(TokenId,Stem).
stem_overlap(OtherAnnotationId,Stem).
stem(Stem).

Matches a token's stemmed form. stem_overlap matches the stems used in words that make up other TextRazor annotations.

lemma(TokenId,Lemma).
lemma_overlap(OtherAnnotationId,Lemma).
lemma(Lemma).

Matches a token's morphological root. lemma_overlap matches the lemmas used in words that make up other TextRazor annotations.

parent(TokenId,ParentTokenId).
parent_overlap(OtherAnnotationId,ParentTokenId).
parent(TokenId,ParentTokenId,RelationToParentStr).

Matches a token's parent ID in the dependency tree. Optionally matches the grammatical relation to the parent. parent_overlap matches the parents of words used in other TextRazor annotations.


sequence(Term,...).

Matches a sequence of tokens in their order in the document. Useful for matching specific product names, people etc. Paramaters can be any of the token rules above, with or without the matching Token ID.


% Matches the literal 'Barack Hussein Obama'.  Doesn't return a link to the matched text, but useful
% in classification problems.
president :-
	sequence('Barack', 'Hussein', 'Obama').

president :-
	sequence(lemma('president'), 'Obama').

% Matches a likely president name, matching two proper nouns occurring after the lemma 'president', and returning
% links to the matching tokens.
president(PresidentName) :-
	sequence(lemma('president'), part_of_speech(PresidentFirstName, 'NNP'), part_of_speech(PresidentSurname, 'NNP')),
	PresidentName = [PresidentFirstName, PresidentSurname].

% Matches all part of speech tags that occur after the lemma 'president', eg 'NNP'.
president_tag(Tag) :-
	sequence(lemma('president'), part_of_speech(Tag)).


sentence(SentenceId,[WordId,...]).
sentence_overlap(SentenceId,[OtherAnnotationId]).

Matches a sentence ID with the list of words that are part of it. Optionally matches one of the other TextRazor annotations that are from this sentence.

sentence_position(SentenceId,SentencePositionInt).

Matches a sentence ID with its position in the processed document.

sentence_root(SentenceId,WordId).

Matches a sentence ID with its position in the processed document.


relation(RelationId,[WordId,...]).

Matches a relation with a list of words that make up its predicate.

relation(RelationId,RelationPredicateWords,RelationTypeA,RelationWordsA,RelationTypeB,RelationWordsB).
relation(RelationPredicateWords,RelationTypeA,RelationWordsA,RelationTypeB,RelationWordsB).
relation(RelationId,RelationPredicateWords,RelationWordsA,RelationWordsB).
relation(RelationPredicateWords,RelationWordsA,RelationWordsB).
relation_overlap(RelationId,[OtherAnnotationId],RelationTypeA,[OtherAnnotationId],RelationTypeB,[OtherAnnotationId]).
relation_overlap([OtherAnnotationId],RelationTypeA,[OtherAnnotationId],RelationTypeB,[OtherAnnotationId]).
relation_overlap(RelationId,RelationPredicateWords,[OtherAnnotationId],[OtherAnnotationId]).
relation_overlap([OtherAnnotationId],[OtherAnnotationId],[OtherAnnotationId]).

Matches a relation with a list of words that make up its predicate, and various combinations of params and param types where needed. overlap_* rules allow more flexible matching of relations between different annotation types.

relation_param(RelationId,RelationType,[WordId,...]).
relation_param(RelationId,[WordId,...]).
relation_param_overlap(RelationId,RelationType,[OtherAnnotationId]).

Matches a relation with one of its paramaters. Matches the type of the param (SUBJECT or OBJECT), and a list of Word IDs that make up the parameter. relation_param_overlap matches params containing words that make up other TextRazor annotations.


property(PropertyId,[TargetWordId,...],[SourceWordId,...]).
property_overlap(PropertyId,[OtherAnnotationTargetId],[OtherAnnotationSourceId]).

Matches a property with a list of words that make up the target and source words. property_overlap matches properties containing words that make up other TextRazor annotations.


noun_phrase(NounPhraseId,[WordId,...]).
noun_phrase_overlap(NounPhraseId,[OtherAnnotationId]).

Matches a noun phrase with a list of words. noun_phrase_overlap matches noun phrases that contain words used in other annotations.


entailment(EntailmentId,SourceWordId,,EntailedWordStr).
entailment_overlap(EntailmentId,[OtherAnnotationSourceId],,EntailedWordStr).
entailment(SourceWordId,,EntailedWordStr).

Matches a contextual entailment with a source and entailed word. entailment_overlap matches entailements who's source words are linked to other TextRazor annotations.

entailment_prior_score(EntailmentId,EntailmentPriorScore).

Matches a contextual entailment with its prior score.

entailment_context_score(EntailmentId,EntailmentContextualScore).

Matches a contextual entailment with its contextual score.

entailment_score(EntailmentId,EntailmentScore).

Matches a contextual entailment with its combined score.


topic(TopicId,TopicLabel).
topic(TopicLabel).

Matches a topic tag, optionally with its ID.

topic_score(TopicId,TopicScore).

Matches a topic's relevance score.

Matches a topic's wikipedia link (if available).


coarse_topic(TopicId,TopicLabel).
coarse_topic(TopicLabel).

Matches a coarse topic tag, optionally with its ID.

coarse_topic_score(TopicId,TopicScore).

Matches a coarse topic's relevance score.

Matches a coarse_topic's wikipedia link (if available).


entity_id(EntityId,EntityIdLabel).
entity_id(EntityIdLabel).

Matches an entity ID label, optionally with its ID.

entity_tokens(EntityId,[WordId,...]).
entity_tokens_overlap(EntityId,[OtherAnnotationId]).

Matches an entity with its tokens. entity_tokens_overlap matches entities who's tokens are used in other annotations.

entity_type(EntityId,EntityType).
entity_type(EntityType).

Matches an entity's dbpedia type, optionally with its ID

entity_freebase_type(EntityId,EntityFreebaseType).
entity_freebase_type(EntityFreebaseType).

Matches an entity's freebase type, optionally with its ID

entity_confidence(EntityId,EntityConfidence).

Matches an entity's confidence score.

Matches an entity's wikipedia link, if it has one

entity_matchedtext(EntityId,EntityMatchedText).

Matches the raw text string that identified this entity.

entity_freebaseid(EntityId,EntityFreebaseId).

Matches the disambiguated freebase id for this entity.

entity_relevancescore(EntityId,EntityRelevanceScore).

Matches the relevance of this entity to the processed page.

entity_englishid(EntityId,EntityEnglishId).

Matches the English language Wikipedia article title for this entity.