Antispam system behind massive book digitization effort
- 29 May, 2007 09:46
- Comments
You know those pesky but necessary CAPTCHA boxes whose squiggly letters and digits you need to retype to make use of certain parts of sites such as Yahoo, Wikipedia and PayPal?
A computer scientist from Carnegie Mellon is looking to replace many of those boxes with antispam boxes of his own for the purpose of helping to digitize and make searchable the text from books and other printed materials. To boot, the system could help companies better secure their Web sites.
The idea is somewhat along the lines of projects like the famous SETI@Home grid supercomputer project for detecting signs of extra terrestrial life from deep space. Organizers of SETI@Home convinced computer users all over the world to allow their computers' CPU cycles to be used to process information for the ET hunt when the systems weren't otherwise being used.
But in the case of Luis von Ahn's project, he and his team are convincing organizations to replace the CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) security boxes on their Web sites with what the assistant professor of computer science calls reCAPTCHA boxes. Instead of requiring visitors to retype random numbers and letters, they would retype text that otherwise is difficult for the optical character recognition systems to decipher when being used to digitize books and other printed materials. The translated text would then go toward the digitization of the printed material on behalf of the Internet Archive project.
"I think it's a brilliant idea -- using the Internet to correct OCR mistakes," said Brewster Kahle, director of the Internet Archive, in a statement. "This is an example of why having open collections in the public domain is important. People are working together to build a good, open system."
Von Ahn says it is estimated that people solve 60 million-plus CAPTCHAs a day, amounting to 150,000 or more man hours of work that can be put to use for the digitization effort. His team is working with Intel to offer a Web-based service enabling Webmasters to adopt reCAPTCHAs to secure their sites.
An audio version is in the works for transcribing radio programs and that can be used by blind Web users.
- Bookmark this page
- Share this article
- Got more on this story? Email Computerworld
- Follow Computerworld on twitter
- Case Study: BNP Paribas Deploys Oracle Exadata to Accelerate Information Processing - The Hardware Perspective
- The State of Data Security
- Oracle IT Modernization Series Modernization: The Path to SOA
- Optimised License Management for the Datacenter
- Cost Effective Security and Compliance with Oracle Database 11g Release 2
-
The NBN, service providers and you... what could go wrong?
-
NBN build gaining momentum daily: Quigley
-
FTC chairman: Do-not-track law may not be needed
-
Kindle sales soar but Amazon mum on actual numbers
-
Wall Street Beat: IPOs, M&A, chip news stir tech optimism
-
Windows 7 for Seniors for Dummies®
-
Windows 7 for Dummies®
-
Teach Yourself Visually Windows 7
-
Office 2007 All-In-One Desk Reference for Dummies
-
Windows 7 for Dummies® Dvd+book Bundle
-
Office 2007 for Dummies
-
MYOB Software for Dummies 6E Australian Edition
-
Excel 2007 All-In-One Desk Reference for Dummies
-
Computers for Seniors for Dummies, 2nd Edition









Comments
Post new comment