Alan Wood’s Unicode Resources

Creating Multilingual Web Pages:

Unicode Support in HTML, HTML Editors and Web Browsers

Home       Site Map


Introduction

Unicode is designed to allow single documents to contain characters or text from many scripts and languages, and to allow those documents to be used on computers with operating systems in any language and still remain intelligible. It is therefore ideally suited to the World Wide Web.

The HTML 4.0 Specification made a major step towards internationalizing the World Wide Web by adopting the Universal Character Set (as specified in ISO/IEC 10646 Information Technology - Universal Multiple-Octet Coded Character Set (UCS)) as the document character set for HTML. The UCS as specified in ISO/IEC 10646-1:2000 is precisely equivalent to the Unicode Standard 3.0.

RFC 2070 (Internationalization of the Hypertext Markup Language) has also been incorporated into HTML 4.0, which now includes provision for languages that are written right-to-left (such as Arabic and Hebrew), for appropriate punctuation, and for combining of letters and diacritics. Recent versions of Internet Explorer go even further, with support for Mongolian, which is written top-to-bottom.

Top

Adding Unicode characters to Web pages

If you only want to use a few Unicode characters that are not on your keyboard, for example mathematical symbols or a few characters in a different script, there are three ways of entering these characters into your text.

  1. Character Entity References
    There are 252 characters that can be included in an HTML file by typing a symbolic name between an ampersand and a semicolon, for example — for an em dash (—). These character entity references are supposed to be displayed independently of the document’s character encoding, and so should work in HTML files with any character encoding.
    Index of character entity references
    Netscape Communicator 4.x cannot display most of these entities unless they are present in the document’s character encoding.

  2. Numeric Character References
    You can enter any Unicode character in an HTML file by taking its decimal numeric character reference and adding an ampersand and a hash at the front and a semi-colon at the end, for example — should display as an em dash (—). This is the method used in the Unicode test pages.
    Numeric character references are supposed to be displayed independently of the document’s character encoding, and so should work in HTML files with any character encoding. Netscape Communicator 4.x cannot display most numeric character references unless they are present in the document’s character encoding.

  3. Hexadecimal Character References
    If you prefer to use hexadecimal numbers instead of decimal ones, you can do so by adding an ampersand, a hash and an x at the front and a semi-colon at the end. For example, — should display as an em dash (—). Any Unicode character can be entered using this method.
    Netscape Communicator 4.x only recognises a few hexadecimal character references.

Top

Using multiple scripts in Web pages

If you want to add text in other scripts to your HTML pages, it would obviously be time-consuming and error-prone to type many numeric character references, so you need to use either an HTML editor with multilingual support, or a word processor that has multilingual support and the ability to save files as HTML with UTF-8 character encoding.

Top

Unicode Fonts

Unicode fonts allow complete character sets for several languages to be held within a single font file, but they do not need to contain all of the Unicode characters. Fonts for specific languages tend to give results that are more acceptable to native speakers than fonts that try to cover many languages and scripts. Editors should ideally be able to utilise more than one font for a single HTML document. Web browsers should be able to utilise more than one font for displaying a page that contains special characters or multiple scripts, by relying on their defaults or on the user’s preferences – it is rarely necessary for the author of a page to specify fonts.

Windows users have an increasing range of Unicode fonts, some for specific languages and others (such as Arial Unicode MS, Bitstream CyberBit and Code2000) covering many languages and scripts. Mac OS X 10 can use fonts intended for Windows, and comes with an increasing range of Mac Unicode fonts that allow a variety of scripts to be edited and displayed.

Although it is not normally necessary, you can use styles to specify preferred fonts (and alternatives) for sections of text in a particular language, by defining a class in a style sheet like this in the <head> of your file:

<style type="text/css">
.thai {font-family:"Cordia New",Ayuthaya,Tahoma,"Arial Unicode MS";}
</style>

You can then apply the style to any HTML tag in the <body>, and also specify a language for a section of text:

Latin text followed by <span class="thai" lang="th">Thai text</span> and more Latin text.

As Web browser support for multiple languages improves, specifying languages should help to provide better language-specific display of diacritics, combined characters, punctuation and hyphenation.

None of the HTML editors or word processors for Mac OS 9 can use Unicode TrueType fonts, even though the operating system supports them. Instead, they make use of Language Kits that use Apple’s proprietary character sets in order to type, display and print foreign and special characters, and then convert to Unicode when a file is saved with UTF-8 character encoding.

Top

Character encodings

The character encoding of an HTML document specifies the technical details of how the characters in the document character set should be represented as bits when stored in a computer file or transmitted over the Internet. Fortunately you do not need to understand the technical details in order to write Web pages.

The only detail about character encodings that a writer needs to know is that some character encodings (for example UTF-8) allow any of the characters in the document character set to be included, while others (for example ISO-8859-1 or SHIFT_JIS) only allow for subsets. However, characters that are not allowed for in a character encoding can still be included in an HTML document by using character references. UTF-8 is the normal character encoding for any HTML file that contains text in two or more non-Latin scripts, but it can be used for any document.

The character encoding can be specified in the charset parameter of a meta tag in the <head> of an HTML document, for example:

<meta http-equiv="content-type" content="text-html; charset=utf-8">

It is better to specify the character encoding in the HTTP header transmitted from a Web server, but this is not under the control of most writers.

Character encoding is also referred to by other names, including character encoding scheme, character coding, charset, coded character set, encoding and transmission character set.

Encoding problems

It is good practice to use the same encoding at all stages in document production, from text editing to display in a Web browser. If you see characters correctly at one part of the process but incorrectly at another part, then you are almost certainly not using the same encoding throughout.

Text displayed as intended:

French text sample

Text encoded as UTF-8 but displayed as ISO-8859-1:

French text encoded as UTF-8 but wrongly displayed with ISO-8859-1 encoding

Text encoded as ISO-8859-1 but displayed as UTF-8:

French text encoded as ISO-8859-1 but wrongly displayed with UTF-8 encoding


Top

Macintosh HTML editors

There are no HTML editors that make use of Mac OS 9’s built-in support for Unicode TrueType fonts, so Mac users are restricted to typing in languages for which Language Kits are available.

Microsoft’s Word 98 and Word 2001 word processors running under Mac OS 9 can use one or more Language Kits to produce multilingual HTML documents with UTF-8 character encoding. These documents include specified fonts, but they still seem to display correctly in Internet Explorer and Netscape browsers on Windows systems that have alternative fonts for the appropriate scripts.

BBEdit 6 is a text editor with many facilities to help produce HTML documents. The editing screen always has the HTML tags visible, and so you have to use a Web browser for previewing pages. It can be used to produce Web pages containing any of the left-to-right scripts for which there are Language Kits, but not Arabic or Hebrew. Documents can contain several scripts, but only one script at a time is displayed correctly. It can open and save HTML files with UTF-8 character encoding.

Netscape Composer 7 can produce and edit HTML files in any of the languages and scripts for which Language Kits are installed. It can work in WYSIWYG or text modes, and it can open and save files with UTF-8 character encoding.

Muwse (formerly called Unisite) is an HTML editor that can display simultaneously any of the languages and scripts for which Language Kits are installed. It uses Web browsers for previewing pages. It can save HTML files with UTF-8 character encoding.

Adobe GoLive 5 is an HTML editor that can display simultaneously any of the left-to-right languages and scripts for which Language Kits are installed. It can work in WYSIWYG and code visible modes, and can open and save HTML files with UTF-8 character encoding.

Adobe InDesign 1.5 can export files as HTML with UTF-8 character encoding, but it does not support the fonts or keyboard drivers in the Language Kits. It can use Unicode Macintosh TrueType fonts (such as the Tahoma font supplied with Office 98 and Word 98), but characters not in MacRoman have to entered from the Select Character ... dialog box on the Type menu.

Other WYSIWYG HTML editors (such as Dreamweaver 4 and Freeway 3) that run under Mac OS 9 are not able to use UTF-8 character encoding. They do support some encodings for non-Latin scripts, but these restrict you to typing in Latin plus one other script, such as Cyrillic, Chinese, Japanese or Korean. They do not support Arabic or Hebrew. The multilingual word processor Nisus Writer can save files as HTML, but not with UTF-8 character encoding.

Top

Windows HTML editors

Windows 95, Windows 98, Windows ME, Windows NT, Windows 2000 and Windows XP all support Unicode TrueType fonts and so can display almost any character on-screen. They also allow you to type in more scripts than Mac OS 9; Windows XP has the most keyboard drivers.

There are four well-known WYSIWYG HTML editors for Windows that can use UTF-8 character encoding, Macromedia’s Dreamweaver MX, Adobe’s GoLive, Microsoft’s FrontPage and Namo WebEditor. All four programs can use any keyboard driver that is available under Windows. GoLive 5 can use the Visual Keyboards (useful if you are not very familiar with a particular keyboard layout), but not the Global IMEs that allow you to type in Chinese, Japanese and Korean. FrontPage 2003 and Namo WebEditor support the Global IMEs and the Visual Keyboards. The Composer component of Mozilla also provides WYSIWYG editing of multi-script pages, supports the Global IMEs and Visual Keyboards, and can save files with UTF-8 encoding.

Microsoft’s Word 97 and Word 2000 allow you to type in several scripts and to save multilingual documents in HTML format with UTF-8 encoding. Word 2000 (but not Word 97) supports the Global IMEs and the Visual Keyboards. Support for Thai and some Indian languages is available in Word 2002. If you want to use HTML files from Word 2000 on a Web site, be sure to obtain the HTML Filter 2.0 from the Microsoft Office Download Center to remove the Office-specific markup tags; Word 2002 has this built-in.

Microsoft’s Internet Explorer 5, 5.5 and 6 Web browsers can convert HTML documents into UTF-8 character encoding, provided that the original file uses one of the encodings on the View > Encoding menu. Simply display the file, make sure that the encoding and the display are correct, and then choose Save As on the File menu.

Saving a page in UTF-8 encoding

HomeSite 5 can open and save files with UTF-8 character encoding, but only supports typing with the ANSI character set.

Top

Web browsers

After creating your multilingual Web pages, you should test them on as many browsers and operating systems as you can. Nearly all of the current Web browsers include Unicode support and can therefore display text simultaneously in several scripts and languages, provided that suitable fonts are installed and that the browser has been configured to use them. More recent browsers are able to use characters from more than one font in order to display a multilingual page correctly.

The three major Web browsers for Windows include support for Unicode TrueType fonts and so can display almost any character that you are likely to find on a Web page. The Mozilla browser probably has the best Unicode support, and can mix characters from several fonts in order to display almost any character. Internet Explorer 6 and Opera 7 are not quite as good at selecting characters from multiple fonts.

Web browsers for Mac OS 9 are not as good as those for Windows at displaying multilingual pages; they do not include support for Unicode fonts and they can only display characters that are supported by an installed Apple Language Kit. Internet Explorer 5 and Netscape 4.8 do not support Arabic, Hebrew or Thai. Mozilla 1.2 and Netscape 7 support Arabic, Hebrew and Thai, and iCab 2 Preview supports Arabic, Devanagari, Gujarati, Gurmukhi, Hebrew and Thai. Opera 6 supports most left-to-right scripts.

For Mac OS X 10, Mozilla, Netscape 7, Safari and Opera 7.5 support most scripts, including Arabic and Hebrew. Internet Explorer 5.2 has the same restrictions as the OS 9 version.


Top

Copyright © 1999–2007 Alan Wood

Created 3rd February 1999   Last updated 18th January 2007

HTML 4.01