Technology Trends

Search

Sifting Rapidly Through Petabytes of Data


Searching within large databases has never been easy. But when it comes to physics, and especially to experiments with particle colliders, the task becomes extremely difficult. You have to look at hundreds of millions of particle collisions to isolate only a few dozens of interest. And when you realize that all these individual records are stored in data files and systems scattered all over the world, it becomes clear that the search process is a tough challenge to crack. But now, a technology known as the Word-Aligned Hybrid (WAH) compression method and developed at Lawrence Berkeley National Laboratory (BNL), is dramatically speeding up the searching process. For example, it took only 15 minutes to retrieve 80 events recorded in 2001 and hidden like needles in a haystack of information inside petabytes of data. But read more…


Here is how BNL describes how the WAH method is used.


WAH is currently used in a software package called FastBit to compress bitmap indexes. A bitmap index is a method of reducing the response time of queries involving common types of conditions in data objects, such as “state = CA” and “age >= 21.” It achieves this by storing certain pre-computed answers as bitmaps. For example, a bitmap index for “state” might have one bitmap for each state in the U.S. Because computers can manipulate bitmaps efficiently, bitmap indices are efficient in searching for interesting records in large datasets.

WAH compression makes the bitmap index optimal in terms of computational complexity. A small number of the most efficient indexing schemes have this optimality property. What makes the new technology unique is that WAH-compressed indexes significantly outperform other schemes in tests.

“In tests conducted using actual data from high-energy physics experiments, we confirmed that our FastBit software is an order of magnitude faster than the best-known bitmap indexing schemes on average,” according to John Wu, the lead developer of FastBit.

Of course, the key here was to build the compressed indexing system.


A number of specialized compression schemes have been proposed to process compressed indexes efficiently, with the best-known one called the Byte-aligned Bitmap Code (BBC).

The goal of the Berkeley Lab project was to create an indexing system that could be compressed and at the same time offers much faster searches than existing methods. To achieve this goal, the WAH compression scheme was developed. While WAH-compressed indexes are slightly larger than BBC-compressed indexes, the time needed to process a query is less, often much less.

Now, let’s look in more details at the Grid Collector, the software used to analyze the petabytes of data generated each year by the STAR (Solenoidal Tracker at RHIC) high-energy physics experiment.


First, here is a link to a paper called “The Grid Collector: Using an Event Catalog to Speedup User Analysis in Distributed Environment” (PDF format, 4 pages, 251 KB).


Then, this research work will be presented next June at the International Supercomputer Conference in Heidelberg, Germany, where it was selected as one of the three best papers.


Here is a link to the abstract of this paper named “Grid Collector: Facilitating Efficient Selective Access from Data Grids.” Below is an excerpt.


Since most analysis jobs filter out significant number of events, a considerable amount of time is wasted by reading the unwanted events. The Grid Collector removes this inefficiency by allowing users to specify more precisely what events are of interest and to read only the selected events. This speeds up most analysis jobs. In existing analysis frameworks, the responsibility of bringing files from tertiary storage to disk falls on the users. This forces most of analysis jobs to be performed at centralized computer facilities where commonly used files are kept on disks.

Finally, the researchers have filed an application for a U.S. patent which was granted on December 14, 2004 under the name “Word aligned bitmap compression method, data structure, and apparatus.” And here is a link to this patent number 6,831,575.


Sources: Lawrence Berkeley National Laboratory news release, May 16, 2005; and various websites


Related stories can be found in the following categories.



  • Databases

  • Physics

  • Search

  • Software


It’s So ‘Ginormous’ that I’m ‘Confuzzled’

Merriam-Webster, the dictionary publisher, recently asked its readers on its web site to submit their favorite words which didn’t exist yet in the dictionary. After receiving about 3,000 submissions, the company published a top ten list of non-existing words. This list is dominated by ‘ginormous‘ (bigger than gigantic and bigger than enormous) and by ‘confuzzled‘ (confused and puzzled at the same time). However, a search on Google reveals a different story. Read more…


A research on ‘ginormous’ brings 70,900 results while a look at ‘confuzzled’ returns 48,300 items — even if you still can’t look at more than a thousand results in reality.


The number 3 on the Merriam-Webster list, ‘woot’ (an exclamation of joy or excitement), is the clear winner on Google, with 717,000 results. But it’s not surprising with the popularity of the Woot web site where you can buy things only on a single day.


‘Chillax’ (chill out/relax, hang out with friends) and ’snirt’ (dirty snow) are numbers 4 and 9 on the Merriam-Webster list, and are respectively mentioned 21,700 and 14,900 times by Google.


After these five words, numbers are falling dramatically.


The number 5 on the Merriam-Webster list, ‘cognitive displaysia’ (the feeling you have before you even leave the house that you are going to forget something), has only been found in 42 documents by Google.


And one of my favorites on this list at number 7, ‘phonecrastinate’ (what you do when you check the caller ID box before answering the phone), is only mentioned 45 times by Google. So I guess there are not many ‘phonecrastinators.’


Other words fare a little bit better: for instance, ‘lingweenie’ (a person incapable of producing neologisms), which is number 10 on the list of the dictionary publisher, is featured in 466 documents found by Google.


Contrary to ‘lingweenies,’ ‘vocabularians’ are people who make up new words. And you can find today 2,040 references to this word on Google.


So, as a very unscientific conclusion, there are about four times more people able to create new words than people who can’t. This is refreshing.


Sources: Roland Piquepaille, with various websites


Related stories can be found in the following categories.



  • Books

  • Education

  • Humor

  • Search


Another Look at Computer-Generated Scientific Papers

Like many of you, I had a good laugh a month ago when I read that some students at the MIT submitted a computer-generated ’scientific’ paper to a computer conference which accepted it, at least in a first step. (See ‘Prank research paper makes the grade‘ for example.) But now, I’m not laughing anymore. Imagine that 100,000 people around the world use this Automatic CS Paper Generator to generate a fake paper and keep it online. In our world of ‘permanent’ information, what will happen in five years when someone uses a search engine looking for keywords contained in the title of these fake papers? One of these papers may appear high in the list of results and this person may use this computer-generated paper as a basis for one of his projects. Scary, isn’t? Read more…


Let’s first go back to the original story in case you missed it.


On April 15, 2005, the MIT News office wrote that some MIT computer science students were so tired to see their papers rejected by scientific conference people that they started to have some doubts about their standards to accept or refuse a paper. (see link above for more details.)


So they decided to have some fun and to write software that generates meaningless research papers and submit them to different organizations.


One of their computer-generated papers, “Rooter: A Methodology for the Typical Unification of Access Points and Redundancy” (PDF format, 4 pages, 92 KB) was initially accepted by World Multi-Conference on Systemics, Cybernetics and Informatics 2005 (WMSCI 2005) as a non-reviewed paper, and later rejected.


Now, let’s have some fun and build a meaningless computer science paper. It’s very easy. On the site mentioned above, you just fill the names of one to five authors and submit your request. That’s all for you to do!


As an example, here are two fake papers that I ‘co-authored’ with some well-known people in the computer industry.



  • Boolean Logic Considered Harmful” by Linus Torvalds, Bill Gates and Roland Piquepaille (PDF format, 6 pages, 95 KB)

  • A Study of XML Using AcridLamb” by Paul Otellini, Roland Piquepaille and Hector Ruiz (PDF format, 3 pages, 46 KB)

It’s pretty easy to imagine a group of people, with fun or evil intentions, to link to such a computer-generated document in order to see it ranked high by search engines. If enough people are putting a link to the first document mentioned above, a Google search for ‘boolean’ and ‘harmful’ will soon return this fake document as its #1 result.


Of course, I don’t see why people would do that. And the probability that a real paper was co-authored by Bill Gates and Linus Torvalds is very low, so I don’t think anyone will think it’s a genuine document.


But lots of ‘phishing’ attacks these days show that people are more gullible than we might think.


So, is the possibility of hundred of thousands of fake computer science papers sitting online represents a danger or not? Time will tell, but please let me know what you think.


Sources: Roland Piquepaille, with various websites


Related stories can be found in the following categories.



  • Computers

  • Science

  • Search

  • Software


After PageRank, Here Comes LexRank

Today, if you want to know what’s going on in the world, you can watch TV, read your newspaper or use Internet to browse news sites. But imagine a day when you just have to enter a few words on your computer, such as “Olympic Games,” push a button, and be able to read an automatic — and accurate — summary of what appears in major sources about this specific subject. This is the goal of a project which started at the University of Michigan and is explained by Technology Research News in “Summarizer ranks sentences.” This new multi-document summarization technique, named LexRank, searches similarities among sentences and rates them via a concept of ‘prestige score’ analogous to the one used by Google’s PageRank. “In a sense, sentences vote for each other just by virtue of being similar to each other,” said one of the researchers. This algorithm may also be applied to automatic translation and question answering in a year or two. Read more…


Let’s start with a description of the project.


Researchers from the University of Michigan have developed a multi-document summarization technique that compares sentences and has the effect of sentences voting for the most important among them. The method, dubbed LexRank, combines the content-sorting concepts of prestige and lexical similarity to find the most important sentences in a group of documents on the same subject.

Algorithms that use prestige to sort information have been around since the ’90s. It is possible to find the most prestigious, or popular member of a network by analyzing the relationships among network members. In a social network, for example, the most prestigious individual can be identified by analyzing the social relations among all pairs of members of the group.

Now, let’s look in more details at how the LexRank algorithm uses similarities among sentences.


The researchers’ lexical centrality algorithm compares the lexical similarity of sentences. “Lexical similarity can be thought of as a measure of the word overlap between two sentences,” said Gunes Erkan [, one of the researchers.] “For example, ‘Bush went to China’ and ‘George Bush visited China’ are fairly similar in a lexical way [but] ‘Bush visited China’ and ‘Blair is the prime minister of the United Kingdom’ have no overlap at all,” he said.

The researchers’ system considers a sentence important if it is similar to many other sentences and if those other sentences are themselves important. “In a sense, sentences vote for each other just by virtue of being similar to each other,” said Dragomir Radev [, an assistant professor at the University of Michigan.] “The sentences with the highest scores… are considered to contain the gist of the document and are presented as the multi-document summary,” he said.

This algorithm is already used for a Web-based news summarization site, NewsInEssence. Please note that this is an experimentation and that the site is not always on. If you cannot access it from the previous link, try this one.


LexRank could have some other usages.


The researchers are also looking for other uses of the lexical centrality algorithm. Possibilities include automatic translation and question answering, said Radev. The method could potentially find sentences that are likeliest to contain the answer to a given natural language question, or, in the biomedical domain, sentences that are most likely to contain important facts like particular protein interactions, said Radev.

The research work was presented in July 2004 during the Empirical Methods in Natural Language Processing (EMNLP 2004) conference held in Barcelona, Spain. Please check the EMNLP 2004 Proceedings if you’re inetrested in the subject.


And for more information, here are links to two technical documents about LexRank, “LexPageRank: Prestige in Multi-Document Text Summarization” (PDF format, 7 pages, 84 KB) and “LexRank: Graph-based Lexical Centrality as Salience in Text Summarization” (PDF format, 23 pages, 272 KB).


Will LexRank become one day as popular as PageRank is today? We’ll know it in a year or two.


Sources: Kimberly Patch, Technology Research News, April 20/27, 2005; and various websites


Related stories can be found in the following categories.



  • Databases

  • Google

  • Internet

  • Search

  • Software


A New Way to Find Art with ‘ArtGarden’

‘ArtGarden’ is a new search engine developed by British Telecom (BT) and tested by Tate Online. In “Smart search lets art fans browse,” BBC News reports it allows you to browse the Tate’s collection depending on what you like or not. Instead of typing an artist’s name, you will be shown an initial selection of pictures of paintings or sculptures. When you click on one image, the artificial intelligence component of ‘ArtGarden’ will choose the next set of pictures to show you. This choice will be partially based on keywords associated with each work of art, but unknown to you, partially on your previous preferences, and finally on plain luck. This technology should soon become available online. With ‘ArtGarden,’ it will be like jumping randomly from one aisle of the museum to another. Neat…


Here is a general description of the technology.


The technology uses a system dubbed smart serendipity, which is a combination of artificial intelligence and random selection. It ‘chooses’ a selection of pictures, by scoring paintings based on a selection of keywords associated with them.

So, for instance a Whistler painting of a bridge may have the obvious keywords such as bridge and Whistler associated to it but will also widen the search net with terms such as aesthetic movement, 19th century and water. A variety of paintings will then be shown to the user, based partly on the keywords and partly on luck.

Like many other technologies, this one has a very personal origin.


For Richard Tateson, [a BT's computing expert] who worked on the ArtGarden project, the need for a new way to search grew out of personal frustration. “I went to an online clothes store to find something to buy my wife for Christmas but I didn’t have a clue what I wanted,” he said.

The text-based search was restricted to looking either by type of garment or designer, neither of which he found helpful. He ended up doing his present shopping on the high street instead.

[Note: Yes, Tateson is his real name.]


BT gives additional details on the project in “Get what you want with ‘ArtGarden’,” starting with a description of the concept.


‘ArtGarden’ is designed for those of us who might not know the name of, say, every Dutch artist painting at the same time as Vermeer, but like his style and would like to see others in a similar vein. In other words, it’s designed for those times when you know what you like, would recognise it if you saw it, but can’t exactly describe it in words. ArtGarden takes advantage of broadband to show a range of art-works on screen, the choice of which may be quickly refined by the user.

You’ll find some more tidbits about the technology.


Behind the scenes ArtGarden uses artificial intelligence to keep track of each viewer’s preferences across a range of criteria. These ’scores’ are used to bias the random selector, which then picks the next item to display.

This way of browsing increases the pleasure of online art viewing because it strikes a balance between focusing browsing towards specific personal ‘likes’ and introducing some serendipity — lucky finds that a more blinkered approach might miss. It’s also simple to use — all you have to do is indicate your preferences using your mouse.

Finally, BT explains why the project is named ‘ArtGarden.’


The ArtGarden demonstration system allows online visitors to view a ‘garden’ or selection of art works, which are linked to others through a series of roots defined by the online curators or managers of collections. New items of interest are positively biased to bloom on the visitors’ screens as they select art according to their preference. Old or disliked items wither away from those displayed. After a few minutes’ browsing, the garden will be very different from the initial random selection.

Before ‘ArtGarden’ becomes available, you can search the Tate Collection with more traditional search engines, for example by searching by subject.


Sources: BBC News Online, January 28, 2005; and various websites


Related stories can be found in the following categories.




  • AI

  • Arts

  • Search

  • Software

  • Vision and Visualization


A Sentimental Education — for Software

Imagine you work for a company which introduces a new product. Obviously, you would want to know if the public likes it or not. But how would you find it? You could search the Web and read every possible document that mentions your product. This might be very time-consuming. Help is on the way, with a software that will scan the Web for you and separate the positive and negative reviews. This software might be based on research done at Cornell University and described by Technology Research News in “Software sorts out subjectivity.” The researchers are improving ’sentiment classification’ by removing neutral sentences. Their machine-learning method then applies only to subjective portions of the document. But the following negative statement, which contains only positive words, shows the difficulty to classify a sentence as positive or negative: “If you think this laptop is a great deal, I’ve got a nice bridge you might be interested in.” It may take a decade before such a system is widely available. Read more…


Here is how Technology Research News introduces the problem of automatic sentiment classification.


One of the fundamental challenges in getting computers to sort and analyze text is finding ways to automatically classify information.

Applications like search engines that group similar documents do so using topic-based categories. Sentiment analysis techniques add another dimension by determining the author’s attitude about a topic rather than just identifying a topic.

Existing techniques tend to concentrate on finding words, phrases and patterns that indicate sentiment. This has proven difficult, however. “This laptop is a great deal”, for instance, shows strong sentiment, but contains the same words as the neutral sentence “The release of this new laptop drew a great deal of media attention.”

So how do you teach a computer to ‘understand’ the meaning of words?


Researchers from Cornell University have devised a way to improve sentiment classification that sidesteps having to deal with meaning by instead concentrating on context. Their method weeds out neutral sentences. “Getting rid of neutral sentences like ‘The release of this new laptop drew a great deal of media attention’ [makes] the overall sentiment more obvious,” said Lillian Lee, an associate professor of computer science at Cornell University.





This diagram shows how the software uses subjectivity detection to obtain a polarity classification via (Credit: Bo Pang and Lillian Lee, Cornell University).

Here are more details about the method.


The researchers represented text as a network, or graph. “Imagine that each sentence is represented by a network point, or node,” said Lee. To model contextual information between each pair of sentence nodes, the researchers added a link whose strength represented how much the two sentences deserved the same label — objective or subjective — based on criteria including how close the sentences are to the text, and whether they are separated by a paragraph boundary.

The model also took into consideration the evidence within a sentence that the sentence is subjective or objective. Possible evidence that a sentence is subjective, for example, includes the presence of a word like ‘wonderful’, or ‘terrible’, said Lee.

Each sentence was linked strongly or weakly to a special subjective and objective nodes depending on the amount of evidence there was within the sentence that it was subjective or objective.

The sentences are then clustered into subjective and objective camps based on the strength of the links. This is a graph partitioning problem known as finding the minimum cut, and it can be solved exactly by a quick, efficient algorithm, said Lee.

And is this approach successful?


The method improved sentiment classification performance from 82.8 to 86.4 percent, which is statistically very significant, according to Lee. The method could eventually be used to maintain review-aggregator Web sites, to filter search results by viewpoint, and to track attitudes toward a given topic, she said.

When will be able to use such a software? And what will it be useful for?


It will take at least a decade before the system can readily handle unrestricted texts containing arbitrary rhetorical devices, she said.

The method could be used by search engines to sort or filter results by viewpoint to, for instance, help users distinguish between objective and biased Web sites, said Lee.

It could also be used to track changes in attitudes toward a given topic by, for instance, analyzing press articles, she said.

And companies could use the system to gather business intelligence such as finding out what people think of their products or the products of their competitors. “A computer company might crawl blogs to find out whether or not people like its latest laptop model,” said Lee.

The research work has been published in the Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics, held July 21 to 26, 2004 in Barcelona, Spain under the title “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.”


Here are two links to the abstract and to the full paper (PDF format, 8 pages, 264 KB). The above diagram was extracted from this paper.


Sources: Kimberly Patch, Technology Research News, November 17/24, 2004; Cornell University website


Related stories can be found in the following categories.




  • Artificial Intelligence

  • Business Intelligence

  • Search

  • Software


Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!