Technology Trends

Databases

Decoding the Genome Needs Superpower

The Wellcome Trust Sanger Institute is one of the largest genomics data centers in the world. In “The Hum and the Genome,” the Scientist writes about the IT infrastructure needed to handle the avalanche of data that researchers have to analyze. With its 2,000 processors and its 300 terabytes of storage, the data center uses today about 0.75 megawatts (MW) of power at a cost of €140,000 per year (about $170K). But the data center will need more than a petabyte of storage within three years, and its yearly electricity bill will reach €500,000 (more than $600K) for about 1.4 MW, enough to power more than a thousand homes. Read more…


Below is a small diagram showing the current IT infrastructure of the Wellcome Trust Sanger Institute, used by the Human Genome Project (Credit: Wellcome Trust Sanger Institute).



Here is a link to a larger version of this chart.


Now, let’s look at this IT infrastructure in detail.



  • Computers


    • Today: The datacenter hosts about 2,000 Alpha processors, originally designed by Digital Equipment (DEC), before its acquisition by Compaq, and later by Hewlett-Packard (HP).

    • Tomorrow: The Sanger Institute is looking at cheaper solutions, especially now that HP has officially stopped any development on the Alpha front.

  • Storage


    • Today: Three different computer rooms have a total capacity of about 300 terabytes.

    • Tomorrow: The IT management forecasts about a petabyte within three years — at least.

  • Databases


    • Today: There are about 40 different databases, and only two of them are in the 50 terabytes area.

    • Tomorrow: One of the databases, the Trace sequence archive currently contains about 700 million entries, and it doubles every 10 months.

  • Power bills


    • Today: The current equipment needs about 0.75 megawatts for a cost of €140,000 per year (about $170K).

    • Tomorrow: The new setup will need about 1.4 megawatts, which will raise the yearly bill to about €500,000 (about $615K today).

The supercomputer vendors can say all they want about diminishing costs. But they almost never talk about the power bills…


Sources: Stuart Blackman, The Scientist, Volume 19, Issue 11, Page 15, June 6, 2005; and various websites


Related stories can be found in the following categories.



  • Databases

  • Energy

  • Genetics

  • IT

  • Storage

  • Supercomputers


Sifting Rapidly Through Petabytes of Data

Searching within large databases has never been easy. But when it comes to physics, and especially to experiments with particle colliders, the task becomes extremely difficult. You have to look at hundreds of millions of particle collisions to isolate only a few dozens of interest. And when you realize that all these individual records are stored in data files and systems scattered all over the world, it becomes clear that the search process is a tough challenge to crack. But now, a technology known as the Word-Aligned Hybrid (WAH) compression method and developed at Lawrence Berkeley National Laboratory (BNL), is dramatically speeding up the searching process. For example, it took only 15 minutes to retrieve 80 events recorded in 2001 and hidden like needles in a haystack of information inside petabytes of data. But read more…


Here is how BNL describes how the WAH method is used.


WAH is currently used in a software package called FastBit to compress bitmap indexes. A bitmap index is a method of reducing the response time of queries involving common types of conditions in data objects, such as “state = CA” and “age >= 21.” It achieves this by storing certain pre-computed answers as bitmaps. For example, a bitmap index for “state” might have one bitmap for each state in the U.S. Because computers can manipulate bitmaps efficiently, bitmap indices are efficient in searching for interesting records in large datasets.

WAH compression makes the bitmap index optimal in terms of computational complexity. A small number of the most efficient indexing schemes have this optimality property. What makes the new technology unique is that WAH-compressed indexes significantly outperform other schemes in tests.

“In tests conducted using actual data from high-energy physics experiments, we confirmed that our FastBit software is an order of magnitude faster than the best-known bitmap indexing schemes on average,” according to John Wu, the lead developer of FastBit.

Of course, the key here was to build the compressed indexing system.


A number of specialized compression schemes have been proposed to process compressed indexes efficiently, with the best-known one called the Byte-aligned Bitmap Code (BBC).

The goal of the Berkeley Lab project was to create an indexing system that could be compressed and at the same time offers much faster searches than existing methods. To achieve this goal, the WAH compression scheme was developed. While WAH-compressed indexes are slightly larger than BBC-compressed indexes, the time needed to process a query is less, often much less.

Now, let’s look in more details at the Grid Collector, the software used to analyze the petabytes of data generated each year by the STAR (Solenoidal Tracker at RHIC) high-energy physics experiment.


First, here is a link to a paper called “The Grid Collector: Using an Event Catalog to Speedup User Analysis in Distributed Environment” (PDF format, 4 pages, 251 KB).


Then, this research work will be presented next June at the International Supercomputer Conference in Heidelberg, Germany, where it was selected as one of the three best papers.


Here is a link to the abstract of this paper named “Grid Collector: Facilitating Efficient Selective Access from Data Grids.” Below is an excerpt.


Since most analysis jobs filter out significant number of events, a considerable amount of time is wasted by reading the unwanted events. The Grid Collector removes this inefficiency by allowing users to specify more precisely what events are of interest and to read only the selected events. This speeds up most analysis jobs. In existing analysis frameworks, the responsibility of bringing files from tertiary storage to disk falls on the users. This forces most of analysis jobs to be performed at centralized computer facilities where commonly used files are kept on disks.

Finally, the researchers have filed an application for a U.S. patent which was granted on December 14, 2004 under the name “Word aligned bitmap compression method, data structure, and apparatus.” And here is a link to this patent number 6,831,575.


Sources: Lawrence Berkeley National Laboratory news release, May 16, 2005; and various websites


Related stories can be found in the following categories.



  • Databases

  • Physics

  • Search

  • Software


After PageRank, Here Comes LexRank

Today, if you want to know what’s going on in the world, you can watch TV, read your newspaper or use Internet to browse news sites. But imagine a day when you just have to enter a few words on your computer, such as “Olympic Games,” push a button, and be able to read an automatic — and accurate — summary of what appears in major sources about this specific subject. This is the goal of a project which started at the University of Michigan and is explained by Technology Research News in “Summarizer ranks sentences.” This new multi-document summarization technique, named LexRank, searches similarities among sentences and rates them via a concept of ‘prestige score’ analogous to the one used by Google’s PageRank. “In a sense, sentences vote for each other just by virtue of being similar to each other,” said one of the researchers. This algorithm may also be applied to automatic translation and question answering in a year or two. Read more…


Let’s start with a description of the project.


Researchers from the University of Michigan have developed a multi-document summarization technique that compares sentences and has the effect of sentences voting for the most important among them. The method, dubbed LexRank, combines the content-sorting concepts of prestige and lexical similarity to find the most important sentences in a group of documents on the same subject.

Algorithms that use prestige to sort information have been around since the ’90s. It is possible to find the most prestigious, or popular member of a network by analyzing the relationships among network members. In a social network, for example, the most prestigious individual can be identified by analyzing the social relations among all pairs of members of the group.

Now, let’s look in more details at how the LexRank algorithm uses similarities among sentences.


The researchers’ lexical centrality algorithm compares the lexical similarity of sentences. “Lexical similarity can be thought of as a measure of the word overlap between two sentences,” said Gunes Erkan [, one of the researchers.] “For example, ‘Bush went to China’ and ‘George Bush visited China’ are fairly similar in a lexical way [but] ‘Bush visited China’ and ‘Blair is the prime minister of the United Kingdom’ have no overlap at all,” he said.

The researchers’ system considers a sentence important if it is similar to many other sentences and if those other sentences are themselves important. “In a sense, sentences vote for each other just by virtue of being similar to each other,” said Dragomir Radev [, an assistant professor at the University of Michigan.] “The sentences with the highest scores… are considered to contain the gist of the document and are presented as the multi-document summary,” he said.

This algorithm is already used for a Web-based news summarization site, NewsInEssence. Please note that this is an experimentation and that the site is not always on. If you cannot access it from the previous link, try this one.


LexRank could have some other usages.


The researchers are also looking for other uses of the lexical centrality algorithm. Possibilities include automatic translation and question answering, said Radev. The method could potentially find sentences that are likeliest to contain the answer to a given natural language question, or, in the biomedical domain, sentences that are most likely to contain important facts like particular protein interactions, said Radev.

The research work was presented in July 2004 during the Empirical Methods in Natural Language Processing (EMNLP 2004) conference held in Barcelona, Spain. Please check the EMNLP 2004 Proceedings if you’re inetrested in the subject.


And for more information, here are links to two technical documents about LexRank, “LexPageRank: Prestige in Multi-Document Text Summarization” (PDF format, 7 pages, 84 KB) and “LexRank: Graph-based Lexical Centrality as Salience in Text Summarization” (PDF format, 23 pages, 272 KB).


Will LexRank become one day as popular as PageRank is today? We’ll know it in a year or two.


Sources: Kimberly Patch, Technology Research News, April 20/27, 2005; and various websites


Related stories can be found in the following categories.



  • Databases

  • Google

  • Internet

  • Search

  • Software


Streaming a Database in Real Time

Michael Stonebraker is well-known in the database business, and for good reasons. He was the computer science professor behind Ingres and Postgres. Eighteen months ago, he started a new company, StreamBase, with another computer science professor, Stan Zdonik, with the goal of speeding access to relational databases. In “Data On The Fly,” Forbes.com reports that the company software, also named StreamBase, is reading TCP/IP streams and using asynchronous messaging. Streaming data without storing it on disk as are doing other relational database software gives them a tremendous speed advantage. The company claims it can process 140,000 messages per second on a $1,500 PC, when its competitors can only deal with 900 messages per second. Too good to be true? Read more…


Here are some excerpts from the Forbes article.


“Relational databases are one to two orders of magnitude too slow,” says Stonebraker, who is chief technology officer at Streambase, a 25-person outfit based in Lexington, Mass. “Big customers have already tried to use relational databases for streaming data and dismissed them. Those products are non-starters in this market.”

In a recent pilot program, Streambase was able to analyze 140,000 messages per second, while a leading relational database — Stonebraker won’t say which one — could handle only 900 messages per second. Streambase has 12 customers now testing its software, all of them financial services companies that need to analyze rapid-fire ticker feeds and other streaming data.

Unlike traditional database programs, Streambase analyzes data without storing it to disk, performing queries on data as it flows. Traditional systems bog down because they first store data on hard drives or in main memory and then query it, Stonebraker says.

The software, which should be commercially available next month, runs on Linux and Solaris, but a Microsoft version should be available soon.


The database business is not a cheap one. So how much this new company will charge for a — largely — unproven software?


Streambase charges customers annual subscriptions for its software, setting prices based on how many CPUs a customer uses to power the software. Typical deals so far have ranged from $100,000 to $300,000 a year, says Barry Morris, Streambase’s chief executive.

In “StreamBase eyes real-time streaming apps,” InfoWorld wrote the prices shoud be lower.


The software is available via a subscription model, with pricing in the range of approximately $50,000 per year, Stonebraker said. Subscriptions are sold on a per-CPU basis.

Who will be the customers for these speedy accesses to their databases? Let’s come back to Forbes.com.


For now Streambase is focusing attention on financial services companies, which hope to do things like track how well traders are performing on a real-time basis, rather than aggregating trades at the end of the day and analyzing them overnight.

A bigger opportunity involves processing real-time data feeds generated by sensor networks and RFID tags. A military contractor wants to use Streambase to keep track of soldiers and vehicles in the battlefield. A casino in Las Vegas is considering using Streambase to track the performance of individual gamblers.

In an interview with InfoWorld, Stonebraker gave more details about military applications.


We did a prototype that dealt with army battalion monitoring. When an army battalion is 30,000 humans and 12,000 vehicles, the army is deadly serious about getting a vital signs monitor on every one of the humans so they can do combat medical triage or [take other actions]. They already have a GPS system in every vehicle, but that didn’t keep Jennifer Lynch’s convoy from getting lost.

They want to turn this into a system to watch the position of every vehicle and compare it against where you’re supposed to be. They also want to put a sensor on the gun turret. Together with position, that allows you to detect crossfire which is a big problem in Iraq. [Also,] they want to put a monitor on the gas gauge and figure out do you have enough fuel to accomplish your mission. It’s this style of application which is large amounts of real-time data with real-time actions to take.

All these numbers, and some pages on the company website are all rosy, but if you want to read a whitepaper or some benchmarks, you need to register — and to be accepted. I’m not sure if it’s a good way to find new customers. But, after all, the company only plans to do some business next month.


If you know solid numbers about this company’s claims, please let me know.


Sources: Daniel Lyons, Forbes.com, January 18, 2005; Paul Krill, InfoWorld, January 7 and 11, 2005; and StreamBase website


Related stories can be found in the following categories.




  • Databases

  • IT

  • Software

  • Military Applications


Looking for something?

Use the form below to search the site:

Still not finding what you're looking for? Drop a comment on a post or contact us so we can take care of it!