Data Desk



  1. Data Desk Software
  2. Data Desk Download
  3. Datadesk Iss
  4. Data Desk Keyboard

Maps, databases, analysis and visualization

  • In 1997, Data Desk was released for Windows, and included a General Linear Model (GLM), multivariate statistics, and nonlinear curve fitting. DD/XL is an add-in for Microsoft Excel that adds Data Desk Functionality directly to the Spreadsheet. Data Desk's developer, Data Description, pioneered linked graphic displays including a 3-D rotating.
  • Contact Us to submit an issue to the Data Desk Technical Support Team. Law of Large Numbers This is a simulation that demonstrates the Law of Large numbers. Click on the striped bar to see a cursor speed across the bar. Release your mouse button to stop the cursor. The color over which it stops appears in the circle to the right and falls into.
  • We're looking at the data behind Colorado headlines to help you understand what's happening in your community. With interactive charts and maps, the Data Desk's original reporting gives you a.
Database: The Times California Cookbook website.
By Anthony Pesce

The Open in Data Desk function is integrated into the latest releases of Data Description's data exploration suite, Data Desk, available for Windows 10.

Nat­ur­al Lan­guage Pro­cessing is a field that cov­ers com­puter un­der­stand­ing and ma­nip­u­la­tion of hu­man lan­guage, and it’s ripe with pos­sib­il­it­ies for news­gath­er­ing. You usu­ally hear about it in the con­text of ana­lyz­ing large pools of le­gis­la­tion or oth­er doc­u­ment sets, at­tempt­ing to dis­cov­er pat­terns or root out cor­rup­tion. I de­cided to take it in­to the kit­chen for my latest pro­ject: The Times Cali­for­nia Cook­book re­cipe data­base.

In this post

The first phase of the pro­ject, the hol­i­day edi­tion, launched with more than 600 hol­i­day-themed re­cipes from The Times Test Kit­chen. It’s a large num­ber, but there’s much more to come next year – we have close to an­oth­er 5,000 re­cipes staged and nearly ready to go.

Data desk xl

With only four months between the concept stage of the site and launch, the Data Desk had a tight time frame and lim­ited re­sources to com­plete two par­al­lel tasks: build the web­site and pre­pare the re­cipes for pub­lic­a­tion. The biggest chal­lenge was pre­par­ing the re­cipes, which were stored in The Times lib­rary archive as, es­sen­tially, un­struc­tured plain text. Pars­ing thou­sands of re­cords by hand was un­man­age­able, so we needed a pro­gram­mat­ic solu­tion to get us most of the way there.

We had a pile of a couple thou­sand re­cords – news stor­ies, columns and more – and each re­cord con­tained one or more re­cipes. We needed to do the fol­low­ing:

  1. Sep­ar­ate the re­cipes from the rest of the story, while keep­ing the story in­tact for dis­play along­side the re­cipe later.
  2. De­term­ine how many re­cipes there were – more than one in many cases, and counts up to a dozen wer­en’t par­tic­u­larly un­usu­al.
  3. For each re­cipe, find the name, in­gredi­ents, steps, prep time, servings, nu­tri­tion and more.
  4. Load these in­to a data­base, pre­serving the re­la­tion­ships between the re­cipes that ran to­geth­er in the news­pa­per.

Where to start?

The well-worn path here at the Data Desk would be to write a pars­er that looks for com­mon pat­terns in format­ting and punc­tu­ation. You can break up the text line by line, then look for one or more reg­u­lar ex­pres­sion matches on each line. It might go something like this:

Then you can make an at­tempt to tag each line of the story with a re­cipe field – de­scrip­tion, name, in­gredi­ent, step, nu­tri­tion, etc. – and write an­oth­er script to as­semble those parts in­to re­cipes that can be loaded in­to a data­base.

After look­ing at a few re­cords it was im­me­di­ately evid­ent we wouldn’t be able to use pure reg­u­lar ex­pres­sions to parse them. We had de­cided to try to grab all of the re­cipes The Times had pub­lished from the year 2000 to present, and there were enorm­ous dif­fer­ences in the format­ting and struc­ture over the years. We needed nat­ur­al lan­guage pro­cessing and ma­chine learn­ing to parse it.

Enter NLTK

Data Desk Software

Nat­ur­al lan­guage pro­cessing is a big field, and you can do a lot with it – the vast ma­jor­ity of which I will not cov­er here. Py­thon, my pro­gram­ming lan­guage of choice, has an ex­cel­lent lib­rary for nat­ur­al lan­guage pro­cessing and ma­chine learn­ing called Nat­ur­al Lan­guage Toolkit, or NLTK, which I primar­ily used for this pro­cess. At left is an ex­ample of what the raw re­cipes looked like com­ing out of our lib­rary archive.

One of the more com­mon uses of NLTK is tag­ging text. You could, for ex­ample, have it tag a news story with top­ics or ana­lyze an email to see if it’s spam. The very ba­sic ap­proach is to token­ize the text in­to words, then pass off those words in­to a clas­si­fi­er that you’ve trained with a set of already-tagged ex­amples. The clas­si­fi­er then re­turns the best fit­ting tag for the text.

For re­cipes, we already have well-defined fields we need to ex­tract. There will be in­gredi­ents, steps, nu­tri­tion, servings, prep time and pos­sibly a couple more. We just need to train a clas­si­fi­er to tell the dif­fer­ence by passing it some ex­amples we’ve done manu­ally. After a bit of re­search and test­ing, I chose to go with a Max­im­um En­tropy clas­si­fi­er be­cause it seemed to fit the pro­ject best and was very ac­cur­ate.

A ba­sic ap­proach might look something like this:

Data Desk Download

The built-in Max­im­um En­tropy clas­si­fi­er can take an ex­ceed­ingly long time to train, but NLTK can in­ter­face with sev­er­al ex­tern­al ma­chine-learn­ing ap­plic­a­tions to make that pro­cess much quick­er. I was able to in­stall MegaM on my Mac, with some modi­fic­a­tions, and used it with NLTK to great ef­fect.

Deep­er ana­lys­is

But that’s just a be­gin­ning, and what is typ­ic­ally de­scribed as a “bag of words” ap­proach. To put it simply, the clas­si­fi­er learns how to tag your text based on the fre­quency of some of the words. It doesn’t ac­count for the or­der of the words, or com­mon phrases or any­thing else. Us­ing this meth­od I was able to tag fields with slightly more than 90% ac­cur­acy, which is pretty good. But we can do bet­ter.

Datadesk Iss

If you think about how a re­cipe is writ­ten, there are more dif­fer­ences between the fields than the in­di­vidu­al words like “but­ter” or “fry.” There might be com­mon phrases like “heat the oven” or “at room tem­per­at­ure.”

There also might be dif­fer­ences in the gram­mar. For ex­ample, how can you cor­rectly tag “Mini ricotta latkes with sour cherry sauce” as a re­cipe title and not an in­gredi­ent? In­gredi­ents might have a reas­on­ably pre­dict­able mix of ad­ject­ives, nouns and prop­er nouns while steps might have more verbs and de­term­iners. A title would rarely have a pro­noun but could in­clude pre­pos­i­tions fairly of­ten.

NLTK comes with a few meth­ods to make this type of ana­lys­is much easi­er. It has a great part-of-speech tag­ger, for in­stance, as well as func­tions for pulling bi-grams and tri-grams (two and three word phrases) out of blocks of text. You can eas­ily write a func­tion that token­izes text in­to sen­tences, then words, then tri-grams and parts of speech. Feed all of that in­to your clas­si­fi­er and you can tag text much more ac­cur­ately.

It could look something like this:

Data Desk Keyboard

Wrap­ping it up

Us­ing a com­bin­a­tion of these meth­ods I was able to pull re­cipes out of news stor­ies very suc­cess­fully. To get the clas­si­fi­er work­ing really well you need to train it on a large, ran­dom sample of your data.

Data

I parsed about 10 or 20 re­cords by hand to get star­ted, then cre­ated a small Django app to ran­domly load a re­cord and at­tempt to parse it. I cor­rec­ted the tags that were wrong, saved the cor­rect ver­sion to a data­base, and peri­od­ic­ally re­trained the clas­si­fi­er us­ing the new samples. I ended up with a couple hun­dred parsed re­cords, and the clas­si­fi­er (which has some built-in meth­ods for test­ing) was about 98% ac­cur­ate.

I wrote a pars­ing script that in­cor­por­ated some reg­u­lar ex­pres­sions and a bit of if/else lo­gic to try to tag as much as I could from format­ting, then used NLTK to tag the rest. After the tag­ging, the story still had to be as­sembled in­to one or more dis­crete re­cipes and loaded in­to a data­base so that hu­mans could re­view them.

That pro­cess was re­l­at­ively straight­for­ward, but I did have to build a cus­tom ad­min for a small group of people to com­pare the ori­gin­al re­cord and parsed out­put side by side. In the end every re­cord had to be re­viewed by hand, and many of them needed one or more small tweaks. Only about one in 20 had struc­tur­al prob­lems. A big thanks to Maloy Moore, Tenny Tatus­i­an and the Food sec­tion staff for comb­ing through all of the re­cords by hand. Com­puters can really only do so much.

If you want to learn more I highly re­com­mend the book Nat­ur­al Lan­guage Pro­cessing with Py­thon, which I read be­fore em­bark­ing on this pro­ject.

An­thony Pesce made this post on Dec. 10, 2013 at 1:45 p.m. An­thony star­ted at the Times in 2009. He builds news ap­plic­a­tions, data visu­al­iz­a­tions and in­ter­act­ive graph­ics, and con­ducts ana­lys­is for re­port­ing pro­jects. He lives in Los Fe­l­iz and grew up in Sac­ra­mento.

Readers: What’s your take? Share it here.

Data is the cornerstone of any modern business, and it's mission-critical that companies have clean, reliable data that they understand and trust. YourDataDesk clients benefit from our comprehensive data management services coupled with our streamlined, efficient Data Desk offering. The YourDataDesk team is comprised of data management thought leaders with executional prowess and experience with big data across top firms. Let us equip your organization for a next leap forward driven by data checked, verified, and analyzed to deliver results you can trust.

From raw data to end product, YourDataDesk handles the time consuming and labor-intensive work of prepping your data for consumptions in dashboards, models and research. YourDataDesk expertly manages structured and unstructured data, allowing our clients to focus on the strategic growth opportunities that will drive success.

Founded and led by one of the original voices in this vital field YourDataDesk has the perspective to transform your approach and deliver a sustainable process built on efficient and forward-thinking data management principles. We help organizations define and build a data culture that drives growth. Acting as an extension of your own team, we provide ongoing, day-to-day data management at a lower cost than many full-time inhouse solutions, helping our clients reduce costs and improve data quality at the same time.