Tutorials

TextRazor was designed to make any text classification or extraction project easy. Here we'll go over a few simple use cases to give you a starting point. Examples are given using our Python SDK for convenience, but the same concepts equally apply to other languages or the REST API

Tutorial 1: Classification: Tag a Document

To get started we're going to use TextRazor to classify a web page and tag it with Entity links.

First we globally set our API key, which will be used by all future requests to uniquely identify our TextRazor account.

>>> import textrazor
>>> textrazor.api_key = "API_KEY_GOES_HERE"
>>>
>>> client = textrazor.TextRazor(extractors=["entities", "topics"])

Next we need to tell TextRazor what we'd like to extract from this text. We're only interested in the Entities and Topics and won't be needing any of the other TextRazor features. In this case then, the "entities" and "topics" extractors are all we need. Since we're fetching a raw web page we'll also want to strip all the boilerplate/html tags from the document.

>>> client.set_cleanup_mode("cleanHTML")

We're also interested in classifying the document to higher level categories. We do this by adding the classifier name to the request:

>>> client.set_classifiers(["textrazor_newscodes"])

Now it's time to process the text. Since this story is a public URL we can just pass in a url. We could also have passed in some raw text if we had it locally.

>>> response = client.analyze_url("http://www.bbc.co.uk/news/uk-politics-18640916")

We've got the response, now we'll sort it to show the most relevant entities first, and print them all out. Each entity can be mentioned more than once, so we check for duplicates as we go.

>>> entities = list(response.entities())
>>> entities.sort(key=lambda x: x.relevance_score, reverse=True)
>>> seen = set()
>>> for entity in entities:
>>>     if entity.id not in seen:
>>>         print entity.id, entity.relevance_score, entity.confidence_score, entity.freebase_types
>>>         seen.add(entity.id)

Debt 0.720049 1.53685 ['/media_common/quotation_subject', '/book/book_subject']
Barclays 0.705099 1.77573 []
Bank 0.635486 1.35011 ['/book/book_subject', '/architecture/building_function']
Mervyn King (economist) 0.619704 2.26693 ['/people/person', '/government/politician', '/business/board_member']
Bill (proposed law) 0.619034 1.03724 ['/tv/tv_subject']
Chancellor of the Exchequer 0.607768 0.952038 ['/government/government_office_or_title']
David Cameron 0.5875 4.18796 ['/people/person', '/film/actor', '/government/politician', '/tv/tv_actor', '/film/person_or_entity_appearing_in_film']
Risk 0.563535 0.98141 ['/media_common/quotation_subject', '/book/book_subject']
Citigroup 0.444129 3.39143 ['/business/employer', '/venture_capital/venture_investor', '/business/sponsor', '/award/ranked_item', '/business/issuer', '/business/business_operation', '/organization/organization', '/business/board_member', '/organization/organization_partnership']
Libor 0.436299 2.08194 []
        			

Since we added the topics extractor, the topics property will also be a part of the response. These are similar to the Entity tags, but contain more general themes that aren't explicitly mentioned in the document:

>>> for topic in response.topics():
>>> 	if topic.score > 0.3:
>>> 		print topic.label

Bank
Libor
Barclays
Politics
Economics
Government
Banking
Business
Bank of England
Financial services
.....

Finally, since we added a classifier to the request we can also expect categories in the response. These are high level categories organized into fixed, more formal taxonomy:

>>> for category in response.categories():
>>> 	print category.category_id, category.label, category.score

11000000 politics 0.936274
11006000 politics>government 0.899296
11024000 politics>politics (general) 0.856208
04017000 economy, business and finance>economy (general) 0.772154
08003000 human interest>people 0.661979
12006001 religion and belief>values>ethics 0.650902
04018000 economy, business and finance>business (general) 0.650722
14025000 social issue>social issues (general) 0.637389
04006000 economy, business and finance>financial and business service 0.637093
14000000 social issue 0.5628
04006002 economy, business and finance>financial and business service>banking 0.561675
.....

Entities, Topic Tags and Categories provide three different levels of abstraction to suit most classification requirements.

Advanced: Customization

One of our most powerful features is the ability to quickly tailor results to your specific project. In this example we create a custom classifier that we customize to support a hypothetical sport portal's faceted browsing feature.

TextRazor's default classifiers don't provide the granularity that you need, so let's create a new classifier:

>>> import textrazor
>>> textrazor.api_key = "API_KEY_GOES_HERE"

>>> manager = textrazor.ClassifierManager()
>>> csv_contents = open("sports_classifier.csv").read()

>>> manager.create_classifier_with_csv("my_sports", csv_contents)

This little bit of boilerplate code uploads the contents of a csv file all in one go. The main classifier is in the file "sports_classifier.csv", which is of the following format:

Category ID,Category Label,Category Query

The Category Query field controls our classification logic. The file sports_classifier.csv looks a bit like this:

1,Golf,concept('golf')
2,Tennis,concept('tennis')
3,Squash,concept('squash')
4,Cricket,concept('cricket')
5,Soccer,"or(concept('soccer'),concept('association football'))"
6,American Football,"or(concept('super bowl'),concept('NFL'),concept('american football'))"

You can augment your concept queries with standard boolean queries to help refine your category. In the above example we've noticed that there is often only a subtle difference between 'American Football' and 'Soccer', and so we have combined multiple 'american football' queries together.

To run our new classifier on a document, simply add the classifier name to our request:

>>> client = textrazor.TextRazor()

>>> client.set_classifiers(["my_sports"])
>>> client.set_cleanup_mode("cleanHTML")

Let's try it out with some web pages:

>>> response = client.analyze_url("http://us.cnn.com/2016/02/01/us/carolina-panthers-haters/index.html")
>>> for category in response.categories():
>>> 	print(category.category_id, category.label, category.score)

6 American Football 0.757698
5 Soccer 0.418123

>>> response = client.analyze_url("http://www.bbc.co.uk/sport/football/35490461")
>>> for category in response.categories():
>>>     print(category.category_id, category.label, category.score)

5 Soccer 0.650844
6 American Football 0.364633
4 Cricket 0.230308

It's a good idea to filter the results with a minimum score threshold that works best on your data - here '0.5' would be a good choice.

>>> response = client.analyze_url("http://edition.cnn.com/2016/01/31/tennis/australian-open-tennis-djokovic-murray/index.html")
>>> for category in response.categories():
>>>     print(category.category_id, category.label, category.score)

2 Tennis 0.650973
3 Squash 0.405858
1 Golf 0.294451
5 Soccer 0.258401
4 Cricket 0.236229

Once more TextRazor correctly classifies this article as being "Tennis" related. There is slight confusion with similar sports, but these are filtered out by our threshold.

Behind the scenes TextRazor uses it's powerful knowledge graph and Machine Learning algorithms to match your queries to the meaning of your document. In this case TextRazor already knows about different sports players, teams, places, and can use this knowledge to help match very simple queries. It also knows about the types of word associated with each concept, so the words "raquet", "net", "doubles" help boost the engine's confidence that the document is about a raquet sport like Tennis or Squash.

If you need any help getting started with more complex taxonomies please contact us, we'd be happy to help.

Tutorial 2: Information Extraction: Find Merger and Acquision activity.

In this example we're looking for mentions of companies being sold. Specifically, we'd like to find all relations involving a "buying" word and two companies.

First let's create the TextRazor client as before, but this time we're looking for relations as well as entities. We also need the "words" extractor to return the words each relation is linked to.

>>> client = textrazor.TextRazor(extractors=["words", "entities", "entailments", "relations"])

We're only interested in companies, fortunately TextRazor allows us to filter the returned entities by their Freebase or DBPedia ID. Freebase and DBPedia are large, comprehensive knowledgebases of real life data available in an open, structured format. Each entity is tagged to a type taxonomy, which TextRazor can use to filter the entities returned. Here we'll filter to entities that match either a Freebase "organization" or DBpedia "company". Every now and then both Freebase and DBPedia incorrectly categorize an entity, by adding filters to both we can make sure we don't miss anything.

>>> client.set_entity_freebase_type_filters(["/organization/organization"])
>>> client.set_entity_dbpedia_type_filters(["Company"])

Now we're ready to make the request.

>>> response = client.analyze("Spain's stricken Bankia expects to sell off its vast portfolio of industrial holdings that includes a stake in the parent company of British Airways and Iberia.")

With the response ready to parse, loop over the relations, and find all those with a purchase word as a lemma in the predicate. The lemma is the morphological root of a word, a normalized form of the word that in this case will match "sold" as well as "sell".

>>> buy_relations = []
>>> for relation in response.relations():
>>>     for word in relation.predicate_words:
>>>         if word.lemma in ("sell", "buy", "acquire"):
>>>             buy_relations.append(relation)
>>>             break

Now we have a bunch of sell relationships. However we're only interested in those between 2 companies, so we'll filter them further here.

>>> for relation in buy_relations:
>>>     entity_params = []
>>>     for param in relation.params:
>>>     	all_entities = list(param.entities())
>>>
>>>         if all_entities:
>>>         	entitiy_params.append(all_entities[0])
>>>
>>>     if len(entity_params) > 1:
>>>         print("Found valid sell relationship between: ", entity_params)

Advanced: Customization

Often you'll want to use several patterns like the above, but expressing that logic in your code can get messy quickly. Instead you can build your logic directly into TextRazor. Our query is the same as above, but we'll extend it to check for words that also entail buy.

>>> rules_str = """
% Match two companies in a 'buy' relation.
acquisition_rumor(CompanyA, CompanyB, EntailedWord) :-
    entity_type(CompanyA, 'Company'),
    entity_type(CompanyB, 'Company'),
    relation_overlap(BuyRelation, 'SUBJECT', CompanyA, 'OBJECT', CompanyB),
    entailment_overlap(_, BuyRelation, EntailedWord),
    member(EntailedWord, ['buy', 'sell', 'acquire']).
"""

>>> from textrazor import TextRazor
>>> client = textrazor.TextRazor(extractors=["acquisition_rumor"])
>>> client.set_rules(rules_str)
>>> response = client.analyze("Spain's stricken Bankia expects to sell off its vast portfolio of industrial holdings that includes a stake in the parent company of British Airways and Iberia.")
>>> print(response.matching_rules())

['acquisition_rumor',]

>>> for rumor in response.acquisition_rumor:
>>> 	print(rumor)

{'CompanyA': entity('Bankia'), 'CompanyB': entity('British Airways')}
{'CompanyA': entity('Bankia'), 'CompanyB': entity('Iberia')}

""""


Tutorial 3: Review Opinion Extraction

There's a vast amount of useful feedback hidden away in product reviews, but separating the useful insight from the noise can be tricky. Here we use TextRazor to parse a page of amazon reviews and extract opinions of the sound quality of a new TV. In this example we'll look at a single page, but in practice we'd probably want to look at a bunch of different pages to reduce the noise and identify the most common comments on this TV's sound quality.

Create the TextRazor client, and set the extractors. Here we are passing in raw HTML, we we'll also tell TextRazor so strip all the boilerplate and tags.

>>> client = textrazor.TextRazor(extractors=["words", "relations"])
>>> client.set_do_cleanup_HTML(True)
>>>
>>> url = "http://www.amazon.com/LG-42LM6200-42-Inch-LED-LCD-Glasses/product-reviews/B006ZH0JW6/ref=cm_cr_dp_see_all_btm?ie=UTF8&showViewpoints=1&sortBy=bySubmissionDateDescending"
>>> response = client.analyze_url(url)

Now we process the response. First we'll look for "sound" mentioned as a "property" in the response. This part of the response gives us the "has a" and "is a" relationships in the text. Since we're processing user reviews we're going to make the assumption these relationships contain the user's opinions of the specific product feature. We might later want to filter out relations describing known facts about this TV's sound ("10w speakers" for example).

Individual words don't mean much on their own, so we'll extract the phrases linked to each property.

>>> for property in response.properties():
>>>     for word in property.predicate_words:
>>>         if word.lemma == "sound":
>>>             for property_word in property.property_words:
>>>                 for phrase in property_word.noun_phrases:
>>>                     print(phrase)
>>>             break

Extracting properties is a good start, but we've missed a couple of expressions where the sound quality has been mentioned as a verb - "The TV sounds like" for example. To capture these we'll loop over the relations and find those with "sound" in the predicate.

>>> for relation in response.relations():
>>>     for word in relation.predicate_words:
>>>         if word.lemma == "sound":
>>>             print(relation)
>>>             break

Contact

Let us know what you're working on, we'd be happy to help. Ready to get started? Signup now to get an API key.