Roland Piquepaille's Technology Trends
How new technologies are modifying our way of life

 
Web www.primidi.com



jeudi 27 juillet 2006
 

Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more...

Let's start with the introduction of this UCI news release -- and forget the marketing hype.

Performing what a team of dedicated and bleary-eyed newspaper librarians would need months to do, scientists at UC Irvine have used an up-and-coming technology to complete in hours a complex topic analysis of 330,000 stories published primarily by The New York Times.

Here is a quote from one of the researchers.

"We have shown in a very practical way how a new text mining technique makes understanding huge volumes of text quicker and easier," said David Newman, a computer scientist in the Donald Bren School of Information and Computer Sciences at UCI. "To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians."

Now, let's look at a real example and as how the team discovered links between topics and people. Below is a graph showing "topic-model-based relationships between entities and topics. A link is present when the likelihood of an entity in a particular topic is above a threshold." (Credit: UCI)

Discovering topics in the NYT archives

Here is another example picked from the UCI news release.

For example, the model generated a list of words that included "rider," "bike," "race," "Lance Armstrong" and "Jan Ullrich." From this, researchers were easily able to identify that topic as the Tour de France. By examining the probability of words appearing in stories about the Tour de France, researchers learned that Armstrong was written about seven times as much as Ullrich.

But what exactly is 'topic modeling'?

Topic modeling looks for patterns of words that tend to occur together in documents, then automatically categorizes those words into topics. Older text-mining techniques require the user to come up with an appropriate set of topic categories and manually find hundreds to thousands of example documents for each category. This human-intensive process is called supervised learning. In contrast, topic modeling, a type of unsupervised learning, doesn't need suggestions for an appropriate set of topic categories or human-found example documents. This makes retrieving information easier and quicker.

This research work has been presented by Newman and his colleagues during the IEEE Intelligence and Security Informatics Conference (ISI 2006), which was held in May in San Diego. Here is a link to their technical paper, "Analyzing Entities and Topics in News Articles Using Statistical Topic Models" (PDF format, 12 pages, 248 KB). The above graph has been extracted from this paper.

For more information about the topic modeling technique used by these scientists, you can look at the works done by Mark Steyvers and his Memory and Decisions Laboratory (MADLAB).

In particular, you can try the software available from this Topic Modeling Toolbox. And as you might not have the archives of the New York Times at your disposal to do some experiments, start with something smaller and see what kind of topics you discover -- using the contents of this blog for example.

Sources: University of California - Irvine, July 26, 2006; and various web sites

You'll find related stories by following the links below.


6:41:16 PM   Permalink        


Click here to visit the Radio UserLand website. © Copyright 2007 Roland Piquepaille.
Last update: 01/04/2007; 19:43:28.


July 2006
Sun Mon Tue Wed Thu Fri Sat
            1
2 3 4 5 6 7 8
9 10 11 12 13 14 15
16 17 18 19 20 21 22
23 24 25 26 27 28 29
30 31          
Jun   Aug


Personal Links



Other Links

Ars Technica
Bloglines
Daily Rotation News
Dave Winer
Danger Room
del.icio.us
Engadget
Gizmodo
John Robb
Jon Udell
OhGizmo!
Really Magazine
Robots.net
Slashdot
Smart Mobs
TG Daily
WorldChanging
ZDNet Blogs


Drop me a note via Radio
Click here to send an email to the editor of this weblog.

E-mail me directly at
pique@noos.fr

RSS subscription for Radio users
Subscribe to "Roland Piquepaille's Technology Trends" in Radio UserLand.

RSS feed for others
Click to see the XML version of this web page.