Google admits word database came from third party

Google has acknowledged using a third-party database to develop software that bears similarity to rival

Faced with mounting questions over similarities with a rival's software, Google Sunday acknowledged that a dictionary of Chinese words used with one of its recently released software tools came from a third party. The statement came as Google faces a looming deadline to stop downloads of the software and issue an apology.

Google's Pinyin Input Method Editor (IME) "was built leveraging some non-Google database resources," Google China spokeswoman Cui Jin wrote in an e-mail response to questions. The IME allows users to enter Chinese characters by typing their Pinyin romanization equivalents.

"We are willing to face this issue of ours," Cui wrote. She did not describe the database or where it came from.

The admission comes as Google faces a deadline from to stop allegedly infringing on its copyrights. On Friday, the Chinese Internet company gave Google until Monday to stop downloads of its IME software and issue an apology. Sohu also wants compensation from Google. At the time of writing, Google's software remains available online.

Cui did not respond to questions concerning Sohu's letter.

Google's Pinyin IME bears an uncanny resemblance to Sohu's Sogou Pinyin IME, which draws on a database of popular search queries from Sohu's Sogou search engine to suggest characters that match the Pinyin entered by a user.

The dictionaries used with both software from Google and Sohu shared several common mistakes, where Chinese characters were matched with the wrong Pinyin equivalents. In addition, both dictionaries listed the names of engineers who had developed Sohu's Sogou Pinyin IME.

These names were added to the Sohu dictionary solely for the convenience of the engineers and would not otherwise need to appear in the dictionary, said Wang Xiaochuan, Sohu's vice president of technology and head of the company's research and development center, in an interview over the weekend.

A review of the first version conducted by Sohu's engineers revealed a dictionary of around 330,000 words and their Pinyin equivalents, including more than 300,000 entries that are identical with Sohu's dictionary, Wang said.

"We have never made this dictionary public or licensed anybody to use it," he said.

Google was slow to respond to questions over its dictionary late last week, even as it made changes to remove similarities with Sohu's Pinyin IME.

On Friday, Google released an updated version of its Pinyin IME that removed the names of the Sohu engineers from its dictionary. That update removed 600 words from the dictionary, while adding just one, Sohu's Wang said. That update did not remove Pinyin errors, such as one mistake that required users to type the incorrect Pinyin -- pinggong -- to get the characters for the name of Feng Gong, an actor and comedian.

That error has been changed in the latest version of Google's Pinyin IME released on Sunday. "The new dictionary is now based on tens of thousands of entries Google's enormous search database has accumulated over the years," Cui wrote.

That claim was confirmed Monday by Sohu, which said the similarity between Google's dictionary and its own dictionary had fallen from 96 percent to 79 percent with the latest version of the software.

Join the newsletter!


Sign up to gain exclusive access to email subscriptions, event invitations, competitions, giveaways, and much more.

Membership is free, and your security and privacy remain protected. View our privacy policy before signing up.

Error: Please check your email address.

More about GoogleSohu.comWang

Show Comments