Roland Piquepaille's Technology Trends
How new technologies are modifying our way of life

 
Web www.primidi.com



lundi 18 décembre 2006
 

If your best friend's name is Bill Gates, you probably have some difficulties to find him online using a search engine. Too many results will point you to the richest man on the planet. In the scientific world, things can be even worse. Imagine a guy named 'John Doe' who has been published in several journals, all using a different policy. His name might appear as 'John Doe,' 'Doe John,' 'J. Doe,' 'Doe J.' or even 'Doe, J.' How will you find the papers he really wrote without knowing the university he works for? Now, computer scientists at Penn State University have developed a system which solves the 'who is J. Smith' puzzle. In fact, they found a way to 'disambiguate' authors with similar names which works pretty well. Their system was able to identify the authors in more than 90% of papers written by almost 500 different authors.

The development of this system was led by C. Lee Giles, professor at the College of Information Sciences and Technology, with the help of two doctoral students, Jian Huang and Seyda Ertekin.

Here is a brief explanation about how this system works.

"The system works by using machine-learning methods to cluster together names that the system believes to be similar. If you think there’s another parameter that’s relevant, you can change the algorithm and include it," Giles said.

In the figure below, you can see the process used for name disambiguation. Given a research paper, each author appearance in this paper is associated with a metadata record, consisting of a set of attributes. The goal is to find a function to match these attributes with a single person. (Credit: Jian Huang, Seyda Ertekin, C. Lee Giles, Penn State)

Process used for name disambiguation

This second figure shows the system architecture, starting with the metadata extraction module which extracts the author metadata records from each paper and ends with the DBSCAN module which builds clusters of papers by different authors. (Credit: Jian Huang, Seyda Ertekin, C. Lee Giles, Penn State)

Disambiguation system architecture

For more information, you can read the paper written by Giles, Huang and Ertekin, and which was presented at the recent 17th European Conference on Machine Learning and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases in Berlin. Here is a link to this paper called "Efficient Name Disambiguation for Large-Scale Databases" (PDF format, 9 pages, 568 KB).

Sources: Penn State University news release, December 14, 2006; and various websites

You'll find related stories by following the links below.


10:19:01 PM   Permalink        


Click here to visit the Radio UserLand website. © Copyright 2007 Roland Piquepaille.
Last update: 01/04/2007; 19:46:00.


December 2006
Sun Mon Tue Wed Thu Fri Sat
          1 2
3 4 5 6 7 8 9
10 11 12 13 14 15 16
17 18 19 20 21 22 23
24 25 26 27 28 29 30
31            
Nov   Jan


Personal Links



Other Links

Ars Technica
Bloglines
Daily Rotation News
Dave Winer
Danger Room
del.icio.us
Engadget
Gizmodo
John Robb
Jon Udell
OhGizmo!
Really Magazine
Robots.net
Slashdot
Smart Mobs
TG Daily
WorldChanging
ZDNet Blogs


Drop me a note via Radio
Click here to send an email to the editor of this weblog.

E-mail me directly at
pique@noos.fr

RSS subscription for Radio users
Subscribe to "Roland Piquepaille's Technology Trends" in Radio UserLand.

RSS feed for others
Click to see the XML version of this web page.