Maps, databases, analysis and visualization
- In 1997, Data Desk was released for Windows, and included a General Linear Model (GLM), multivariate statistics, and nonlinear curve fitting. DD/XL is an add-in for Microsoft Excel that adds Data Desk Functionality directly to the Spreadsheet. Data Desk's developer, Data Description, pioneered linked graphic displays including a 3-D rotating.
- Contact Us to submit an issue to the Data Desk Technical Support Team. Law of Large Numbers This is a simulation that demonstrates the Law of Large numbers. Click on the striped bar to see a cursor speed across the bar. Release your mouse button to stop the cursor. The color over which it stops appears in the circle to the right and falls into.
- We're looking at the data behind Colorado headlines to help you understand what's happening in your community. With interactive charts and maps, the Data Desk's original reporting gives you a.
The Open in Data Desk function is integrated into the latest releases of Data Description's data exploration suite, Data Desk, available for Windows 10.
Natural Language Processing is a field that covers computer understanding and manipulation of human language, and it’s ripe with possibilities for newsgathering. You usually hear about it in the context of analyzing large pools of legislation or other document sets, attempting to discover patterns or root out corruption. I decided to take it into the kitchen for my latest project: The Times California Cookbook recipe database.
The first phase of the project, the holiday edition, launched with more than 600 holiday-themed recipes from The Times Test Kitchen. It’s a large number, but there’s much more to come next year – we have close to another 5,000 recipes staged and nearly ready to go.
With only four months between the concept stage of the site and launch, the Data Desk had a tight time frame and limited resources to complete two parallel tasks: build the website and prepare the recipes for publication. The biggest challenge was preparing the recipes, which were stored in The Times library archive as, essentially, unstructured plain text. Parsing thousands of records by hand was unmanageable, so we needed a programmatic solution to get us most of the way there.
We had a pile of a couple thousand records – news stories, columns and more – and each record contained one or more recipes. We needed to do the following:
- Separate the recipes from the rest of the story, while keeping the story intact for display alongside the recipe later.
- Determine how many recipes there were – more than one in many cases, and counts up to a dozen weren’t particularly unusual.
- For each recipe, find the name, ingredients, steps, prep time, servings, nutrition and more.
- Load these into a database, preserving the relationships between the recipes that ran together in the newspaper.
Where to start?
The well-worn path here at the Data Desk would be to write a parser that looks for common patterns in formatting and punctuation. You can break up the text line by line, then look for one or more regular expression matches on each line. It might go something like this:
Then you can make an attempt to tag each line of the story with a recipe field – description, name, ingredient, step, nutrition, etc. – and write another script to assemble those parts into recipes that can be loaded into a database.
After looking at a few records it was immediately evident we wouldn’t be able to use pure regular expressions to parse them. We had decided to try to grab all of the recipes The Times had published from the year 2000 to present, and there were enormous differences in the formatting and structure over the years. We needed natural language processing and machine learning to parse it.
Enter NLTK
Data Desk Software
Natural language processing is a big field, and you can do a lot with it – the vast majority of which I will not cover here. Python, my programming language of choice, has an excellent library for natural language processing and machine learning called Natural Language Toolkit, or NLTK, which I primarily used for this process. At left is an example of what the raw recipes looked like coming out of our library archive.
One of the more common uses of NLTK is tagging text. You could, for example, have it tag a news story with topics or analyze an email to see if it’s spam. The very basic approach is to tokenize the text into words, then pass off those words into a classifier that you’ve trained with a set of already-tagged examples. The classifier then returns the best fitting tag for the text.
For recipes, we already have well-defined fields we need to extract. There will be ingredients, steps, nutrition, servings, prep time and possibly a couple more. We just need to train a classifier to tell the difference by passing it some examples we’ve done manually. After a bit of research and testing, I chose to go with a Maximum Entropy classifier because it seemed to fit the project best and was very accurate.
A basic approach might look something like this:
Data Desk Download
The built-in Maximum Entropy classifier can take an exceedingly long time to train, but NLTK can interface with several external machine-learning applications to make that process much quicker. I was able to install MegaM on my Mac, with some modifications, and used it with NLTK to great effect.
Deeper analysis
But that’s just a beginning, and what is typically described as a “bag of words” approach. To put it simply, the classifier learns how to tag your text based on the frequency of some of the words. It doesn’t account for the order of the words, or common phrases or anything else. Using this method I was able to tag fields with slightly more than 90% accuracy, which is pretty good. But we can do better.
Datadesk Iss
If you think about how a recipe is written, there are more differences between the fields than the individual words like “butter” or “fry.” There might be common phrases like “heat the oven” or “at room temperature.”
There also might be differences in the grammar. For example, how can you correctly tag “Mini ricotta latkes with sour cherry sauce” as a recipe title and not an ingredient? Ingredients might have a reasonably predictable mix of adjectives, nouns and proper nouns while steps might have more verbs and determiners. A title would rarely have a pronoun but could include prepositions fairly often.
NLTK comes with a few methods to make this type of analysis much easier. It has a great part-of-speech tagger, for instance, as well as functions for pulling bi-grams and tri-grams (two and three word phrases) out of blocks of text. You can easily write a function that tokenizes text into sentences, then words, then tri-grams and parts of speech. Feed all of that into your classifier and you can tag text much more accurately.
It could look something like this:
Data Desk Keyboard
Wrapping it up
Using a combination of these methods I was able to pull recipes out of news stories very successfully. To get the classifier working really well you need to train it on a large, random sample of your data.
I parsed about 10 or 20 records by hand to get started, then created a small Django app to randomly load a record and attempt to parse it. I corrected the tags that were wrong, saved the correct version to a database, and periodically retrained the classifier using the new samples. I ended up with a couple hundred parsed records, and the classifier (which has some built-in methods for testing) was about 98% accurate.
I wrote a parsing script that incorporated some regular expressions and a bit of if/else logic to try to tag as much as I could from formatting, then used NLTK to tag the rest. After the tagging, the story still had to be assembled into one or more discrete recipes and loaded into a database so that humans could review them.
That process was relatively straightforward, but I did have to build a custom admin for a small group of people to compare the original record and parsed output side by side. In the end every record had to be reviewed by hand, and many of them needed one or more small tweaks. Only about one in 20 had structural problems. A big thanks to Maloy Moore, Tenny Tatusian and the Food section staff for combing through all of the records by hand. Computers can really only do so much.
If you want to learn more I highly recommend the book Natural Language Processing with Python, which I read before embarking on this project.
Anthony Pesce made this post on Dec. 10, 2013 at 1:45 p.m. Anthony started at the Times in 2009. He builds news applications, data visualizations and interactive graphics, and conducts analysis for reporting projects. He lives in Los Feliz and grew up in Sacramento.
Readers: What’s your take? Share it here.
Data is the cornerstone of any modern business, and it's mission-critical that companies have clean, reliable data that they understand and trust. YourDataDesk clients benefit from our comprehensive data management services coupled with our streamlined, efficient Data Desk offering. The YourDataDesk team is comprised of data management thought leaders with executional prowess and experience with big data across top firms. Let us equip your organization for a next leap forward driven by data checked, verified, and analyzed to deliver results you can trust.
From raw data to end product, YourDataDesk handles the time consuming and labor-intensive work of prepping your data for consumptions in dashboards, models and research. YourDataDesk expertly manages structured and unstructured data, allowing our clients to focus on the strategic growth opportunities that will drive success.
Founded and led by one of the original voices in this vital field YourDataDesk has the perspective to transform your approach and deliver a sustainable process built on efficient and forward-thinking data management principles. We help organizations define and build a data culture that drives growth. Acting as an extension of your own team, we provide ongoing, day-to-day data management at a lower cost than many full-time inhouse solutions, helping our clients reduce costs and improve data quality at the same time.