Antispam system behind massive book digitization effort

Computer scientist puts automated turing test to use in book digitization effort

You know those pesky but necessary CAPTCHA boxes whose squiggly letters and digits you need to retype to make use of certain parts of sites such as Yahoo, Wikipedia and PayPal?

A computer scientist from Carnegie Mellon is looking to replace many of those boxes with antispam boxes of his own for the purpose of helping to digitize and make searchable the text from books and other printed materials. To boot, the system could help companies better secure their Web sites.

The idea is somewhat along the lines of projects like the famous SETI@Home grid supercomputer project for detecting signs of extra terrestrial life from deep space. Organizers of SETI@Home convinced computer users all over the world to allow their computers' CPU cycles to be used to process information for the ET hunt when the systems weren't otherwise being used.

But in the case of Luis von Ahn's project, he and his team are convincing organizations to replace the CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) security boxes on their Web sites with what the assistant professor of computer science calls reCAPTCHA boxes. Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project.

"I think it's a brilliant idea -- using the Internet to correct OCR mistakes," said Brewster Kahle, director of the Internet Archive, in a statement. "This is an example of why having open collections in the public domain is important. People are working together to build a good, open system."

Von Ahn says it is estimated that people solve 60 million-plus CAPTCHAs a day, amounting to 150,000 or more man hours of work that can be put to use for the digitization effort. His team is working with Intel to offer a Web-based service enabling Webmasters to adopt reCAPTCHAs to secure their sites.

An audio version is in the works for transcribing radio programs and that can be used by blind Web users.

More about: HIS Limited, Intel, Mellon, PayPal, PLUS, Recognition Systems, SETI, Yahoo

Comments

Post new comment

The content of this field is kept private and will not be shown publicly.
Users posting comments agree to the Computerworld comments policy.
Login or register to link comments to your user profile, or you may also post a comment without being logged in.
Related Whitepapers
Latest Stories
Community Comments
Whitepapers
All whitepapers
Sign up now to get free exclusive access to reports, research and invitation only events.
Featured Download
/downloads/product/20/adawarefree/

Lavasoft Ad-Aware Free

Ad-Aware Free has long been one of the most popular spyware killers on the planet, and with good reason. It's simple to use, does an ...

Computerworld newsletter

Join the most dedicated community for IT managers, leaders and professionals in Australia