Google has launched a system that will open up millions of additional pages of new information to its leading Internet search engine, by starting to index the writing contained within scanned documents placed on the Web in Adobe Systems' popular PDF file format, the Mountain View, California-based company announced Thursday. The move to add the contents inside scanned Portable Document Format files promised to greatly increase the amount of information Google's search engine sifts through every day, and to make more academic papers, government documents and other written material available to searchers who use Google. Vast Improvement Over Previous PDF File Indexing Efforts "In the past, scanned documents were rarely included in search results as we couldn't be sure of their content," said Evin Levey, a Google product manager who announced the indexing initiative in a message posted Thursday on the Google Web site. By using advanced optical character recognition (OCR) software it has fine-tuned over the years in other company efforts such as Google Book Search, Google said that it had greatly expanded on a previous PDF file indexing technique that let searchers see only a brief abstract, or sometimes nothing whatsoever, about the what the scanned PDF file contained. "While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read," said Levey in the Thursday message. "We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format," Levey said, and to then turn the resulting scan into "words that can be searched and indexed." Incorporating A Series Of Technologies From Other Google Efforts The system Google announced Thursday may also draw on technology found in a January patent filing expanding the scope of what can be searched online, by recognizing text within the images that make up videos such as those on YouTube using a method that allows computers to find any text within images, enhance them and then performing OCR on any text found, essentially turning the text from a vacation video of a busy city street into a searchable database of business names, street signs, window signs or even the words on the tee-shirts of people passing by on the street. Google did not reveal details of the process the new PDF indexing system used, however some observers considered it likely to incorporate at least some of an existing Google OCR project called OCRopus. Google said the new more complete scanned PDF file indexing system had brought the company closer to one of its original goals, a move Levey called "a small but important step forward in our mission of making all the world's information accessible and useful." Google Now Indexing Text Within Scanned Adobe PDF Files On Tuesday Google announced a $125 million settlement surrounding its book scanning program Google Book Search, another scanning initiative the Internet giant has seen as important to fulfilling an original goal of company co-founders Sergey Brin and Larry Page. "Google's mission is to organize the world's information and make it universally accessible and useful," Brin said in announcing the settlement. Google's search engine returns a list of some 300 million PDF files when queried for the Adobe file type, which are likely to contain a wide range of information about all manner of subjects, for the first time available to users of the Internet's leading search engine site. Related Links:
|