News
 
Unicode nearing 50% of the web
2010-02-01 15:52
Administrator

Google logoAbout 18 months ago, we published a graph showing that Unicode on the web had just exceeded all other encodings of text on the web. The growth since then has been even more dramatic.

Web pages can use a variety of different character encodings, like ASCII, Latin-1, or Windows 1252 or Unicode. Most encodings can only represent a few languages, but Unicode can represent thousands: from Arabic to Chinese to Zulu. We have long used Unicode as the internal format for all the text we search: any other encoding is first converted to Unicode for processing.

This graph is from Google internal data, based on our indexing of web pages, and thus may vary somewhat from what other search engines find. However, the trends are pretty clear, and the continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover.

Searching for "nancials"?

Unicode is growing both in usage and in character coverage. We recently upgraded to the latest version of Unicode, version 5.2 (via ICU and CLDR). This adds over 6,600 new characters: some of mostly academic interest, such as Egyptian Hieroglyphs, but many others for living languages.

We're constantly improving our handling of existing characters. For example, the characters "fi" can either be represented as two characters ("f" and "i"), or a special display form "fi". A Google search for [financials] or [office] used to not see these as equivalent — to the software they would just look like *nancials and of*ce. There are thousands of characters like this, and they occur in surprisingly many pages on the web, especially generated PDF documents.

But no longer — after extensive testing, we just recently turned on support for these and thousands of other characters; your searches will now also find these documents. Further steps in our mission to organize the world's information and make it universally accessible and useful.

And we're angling for a party when Unicode hits 50%!

Source: googleblog

Last news
 
avast! Free Antivirus 5.0.462
2010-03-10

avast! logoNew version of well-known free antivirus avast! Free Antivirus was released. There are minor interface updates and bug fixes in the list of modifications.

 
 
Notepad++ 5.6.8
2010-03-10

Notepad++ logoThe updated version of the Notepad++, a program for plain text editing, was released. It supports great number of functions, including syntax highlighting of different programming languages. Even it is possible to print highlighted text on printer.

 
 
MeeGo code coming in March, will run on Atom boards and N900
2010-03-09

MeeGo logoIn an announcement published last week, Nokia's Valtteri Halla revealed that Intel and Nokia are planning to launch the public MeeGo source code repository by the end of the month.

 
 
Chrome OS to get business-grade edition
2010-03-09

Google Chrome OS logoAt the RSA Conference on Thursday, a Google software security engineer said that Google will out a business version of its Chrome OS Netbook operating system in 2011, after the consumer version is released later this year.

 
 
Opera Browser Downloads Triple After Microsoft Airs Browser Ballot
2010-03-09

Opera Software logoMicrosoft's Internet Explorer 8 is the slowest of the major browsers on the market, but it (along with its previous editions) is also currently still clinging to almost 60 percent market share. Some say the large market share is because it's relatively secure (despite a large number of attacks due to its major market share) and because its easily managed with IT software.

 
Search:
Updates