Stanford Natural Language Processing (NLP) group at Stanford University has an open suite of language analysis tools that are available for the public to use. Most of the tools are only available in English but some have been translated into Chinese, Spanish, German, and Arabic. This tutorial will focus on the English tool sets, specifically the Named Entity Recognizer and the Parts of Speech Tagger. This is helpful is being able to pinpoint and extract specific locations / organizations from a text; Or if you wanted to look at the complexity of sentence structure; Or even looking for hesitations in transcripts for english as a second language learners and where they pause the longest. There are various applications to this technology in research and learning.
Named Entity Recognizer
The Named Entity Recognizer (or NER) will label words in the text that are names of things, such as a person, organization, location, and even gene and protein names. The output once your text is run through NER will look something like the image below with the NER output on the left and the Terminal output on the right:
Parts Of Speech Tagger
The Parts of Speech Tagger will allow you to copy and paste large quantities of text into the tagger and the tagger will assign parts of speech to each word such as noun, verb, adjective, etc. This tool tags parts of speech with 96.97% accuracy. The output when this is run will look something like what you see below:
Let’s get started with these tools!
You’ll need to have Java version 1.8 or later installed on your computer to run the Stanford NLP (Natural Language Processing) Software. To install Java go to Oracle’s website, click the Agree to Terms button and then choose the product you’re installing Java on.
Here are some additional instructions on how to install Java if you run into difficulties.
Part 1: Using the Named Entity Recognizer (NER)
Download the Named Entity Recognizer (NER) Software
The Named Entity Recognizer (or NER) will label words in the text that are names of things, such as a person, organization, location, and even gene and protein names. To use this free software you can download it here.
Make sure to save the NER files on your Desktop or some easily accessible place on your computer. Once the file is done downloading, unzip the file by double clicking it:
I like to rename the file to just stanford-ner, so that it’d easier to call the file from the Terminal window.
Using NER Through Terminal
Next open up Terminal and navigate to the stanford-ner folder.
To access Terminal on a Mac or Command Prompt on Windows you can check out the tutorials below:
- If you have a Mac check out this video to learn more
- If you have Windows 8 check out this video to learn more
- This post shows how to open the command prompt for pre-Windows 8 systems
After you’re in the stanford-ner folder in Terminal, copy and paste the following into the Terminal window:
Doing this should cause the Stanford Named Entity Recognizer to open:
Inside of this box you can delete the current text and paste your own text into the box. Next we need to run a classifier, which is a machine learning tool that takes the data items and places them into one of the k classes (what’s a k class???). To do this go to “Classifier” and “Load CRF from File”:
Next, select the “english.muc.7class.distsim.crf.ser” classifier from the classifier folder and click “Open”:
Several tags should now appear in the NER window on the right hand side of the screen and the NER button at the bottom should be highlighted now. Go ahead and click it.
After you click “Run NER” two things should happen. One the NER window should now have highlighted the corresponding tags on the right within the text like so:
And two, the terminal window should also list all the tags for location, organization, date, money, persons, time, etc:
And you’re done learning now to use Stanford’s Named Entity Recognizer! Now onto the Parts of Speech Tagger.
Part 2: Using the Parts of Speech Tagger
Download the Parts Of Speech Tagger
The Parts of Speech Tagger will allow you to copy and paste large quantities of text into the tagger and the tagger will assign parts of speech to each word such as noun, verb, adjective, etc. If you need to tag the parts of speech in your document you can download it here.
Go ahead and click the “basic English Stanford Tagger” since we’ll only be analyzing text in English.
Many of the steps that we do here are similar to what’s described above. This tagger uses the ‘english-left3words-distsim.tagger’ model which has a 96.97% accuracy when tagging the text you input. You can read more about common questions on the Parts of Speech Tagger here.
Using the Parts of Speech Tagger Through Terminal
Open up a Terminal window and navigate to the “stanford-postagger” folder that you just downloaded. There are instructions above on how to use Terminal and navigate to a folder using it. Once you’re in the folder, copy and paste the following command into the Terminal window:
Once this line of code finishes running, the following window will appear:
You can copy and paste the text you’d like to tag in the first text box and click “Tag Sentence!”
The output will look something like this:
You’ll notice that all the tags for the parts of speech are attached to the word with an “_”. The tags are based on the University of Pennsylvania Treebank Tag-set, which the University of Leeds has a good decrypter available here (i.e. JJ = adjective, NN = Noun, etc).
If you’d like to learn more about Stanford’s Natural Language Processing software and other free software tools, you can learn more at their home site where they have links to additional resources as well.
Thanks for reading!