Extracting Entities in RapidMiner Studio™ with Rosette


It’s never been easier to access state of the art text analytics, code-free. Check out our Rosette Text Toolkit extension for RapidMiner—a popular, open source predictive analytics platform—and plug the power and accuracy of Rosette text analytics directly into your RapidMiner workflows.
Get up and running with Rosette for RapidMiner Studio with this quick start guide, which covers the installation and setup process. We also demonstrate how to get started extracting and linking entities with Rosette.
安装RapidMiner and Rosette
If you aren’t already running RapidMiner Studio, download the application onRapidMiner’s website, and then download the Rosette Text Toolkit extension through the RapidMiner marketplace and sign up for a Rosette API key.
Open RapidMiner Studio, navigate to theExtensionsmenu and selectMarketplace.
A new window will open. Search for “rosette” and selectRosette Text Toolkitfrom the list of results. Click theInstall 1 Packagesbutton at the bottom of the window and follow the click-through instructions to complete the installation.
Once the extension has finished installing, the Rosette operators will be visible in theExtensionsfolder of theOperators面板。
Getting a Rosette API Key
In order to activate the Rosette Text Toolkit for RapidMiner Studio, you’ll need an API key and a Rosette developer account. Head over todeveloper.rosette.comand complete the signup process.
You can create an account linked to either your email or your GitHub account. No credit card is required — our default plan gives you 10,000 calls a day for free! If you’re interested in upping your call quota,check out our paid plans.
Once you have completed the signup process and verified your account, click on theAPI Keytab on the top left of the menu bar to display your key.
Setting up your Rosette API Connection
Back in RapidMiner Studio, input your Rosette API key to start using any of Rosette’s operators. We’ll be looking at the entity extraction operator in the next section, so we’ll use it to set up the Rosette API connection now.
First, locateExtract Entitiesin the Rosette Text Toolkit folder in theOperatorspanel and drag it to theProcess面板。
You can see the various settings options for the Extract Entities operator in the TheParameterspanel to the right of theProcess面板。The first parameter isConnection. Click the Rosette icon to the right of the box.
TheManage Connectionswindow will open. Click theAdd Connectionbutton on the bottom left and selectRosette Connectionfrom theConnection typedropdown list. Name your new connection and click theCreatebutton.
Select your new Rosette API connection from the list on the left and enter your Rosette API key in theAPI KEYbox. Use theTestbutton at the bottom of the window to verify that your connection is working. If you run into any trouble, confirm that you have copied your API key correctly. When you are satisfied that everything is running smoothly, click the保存所有更改button to return to theParameters面板。
Select your new connection from theConnectiondropdown list.
Extracting Entities
Now that you’ve installed the Rosette for RapidMiner extension and set up your API key and connection, it’s easy to get started using the Rosette operators. Let’s try entity extraction. We’ll use three operators to create a simple entity extraction workflow, or process:Create Document,Documents to Data, andExtract Entities. Drag these operators into theProcesspanel and connect them together, maintaining the order listed above. You can find the operators using theOperators Search Bar.
Select theCreate Documentoperator. In the parameter panel, check the add label box. Underlabel type, select text and enter ‘my_text’ for label value. Click theEdit Textbutton at the top of the panel and copy the text below into the popup window.
“Bill Murray will appear in new Ghostbusters film: Dr. Peter Venkman was spotted filming a cameo in Boston this…http://dlvr.it/BnsFfS.”
Hit theApply Changesbutton to save your work.
Now select theDocuments to Dataoperator. In theParameterspanel, enter ‘my_text’ in the text attribute field.
Execute the process using the blue “play” button. The results show five extracted entities. As you can see, Rosette correctly extracted both the names and the location included in the text.
Let’s make our input text a little longer. Add the sentence below to the parameter text and rerun the process.
“Another original Ghostbuster, Dan Akroyd, is also confirmed to have a cameo in the film.”
From the results we can see that Rosette extracts Dan Akroyd’s name as expected. However, eagle-eyed readers may have noticed that “Akroyd” is misspelled. (It should be “Aykroyd.”) This is not uncommon. Name misspellings appear frequently, everywhere from personal blogs to the New York Times online. If you are trying to track a particular entity across a large collection of documents, you want to make sure that you are identifying all possible spellings of that entity’s name. Rosette automatically extracts and links entities with spelling variations and other textual anomalies, unifying them into a single entry.
To demonstrate this functionality, let’s enableLink Entitiesin theExtract Entitiesparameter panel.
Then, we’ll add a third line to the parameter text that includes the correct spelling of Dan Aykroyd’s name, like the one below:
“Actually, the correct spelling is Aykroyd.”
当我们再次运行的过程,一个新的美联社QID列pears in the results. Notice that “Dan Akroyd” and “Aykroyd” have the same QID value — Rosette has correctly identified them as the same entity.
QID values are drawn from Wikidata, so if an entity has a Wikidata entry, Rosette should be able to link and resolve it.
QIDs are very useful for machine reading-purposes, but for humans they can be difficult to keep track of. Let’s turn on theInclude Entity Nameparameter, which will allow us to see the entity names in addition to their QIDs.
Try it Yourself
Now that you’ve got the Rosette Text Toolkit up and running with RapidMiner Studio, you are well equipped to handle a host of text analytics tasks. Incorporate results like the ones above into your pre-existing data processes, and check out our other operators, including Categorization, Sentiment Analysis, Morphological Analysis, Tokenization, Sentence Tagging, Name Translation, and Name Matching.
While you’re at it,keep us posted! We love to hear what our users are working on, and would be thrilled to share your Rosette for RapidMiner story on our blog.
Comments
What if I have a dataset of posts in a .csv file? Can I just use the "Read CSV" icon and the "Extract Entities" icon to do a simple entity extraction?