1、 Reference numberISO/IEC TR 19769:2004(E)ISO/IEC 2004TECHNICAL REPORT ISO/IECTR19769First edition2004-07-15Information technology Programming languages, their environments and system software interfaces Extensions for the programming language C to support new character data types Technologies de lin
2、formation Langages de programmation, leurs environnements et interfaces de logiciel systme Extensions pour que le langage de programmation C supporte des types de donnes de caractres nouveaux Copyright International Organization for Standardization Reproduced by IHS under license with ISO Not for Re
3、saleNo reproduction or networking permitted without license from IHS-,-ISO/IEC TR 19769:2004(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded
4、are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Deta
5、ils of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem r
6、elating to it is found, please inform the Central Secretariat at the address given below. ISO/IEC 2004 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfil
7、m, without permission in writing from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO/IEC 2
8、004 All rights reservedCopyright International Organization for Standardization Reproduced by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-Contents Page Foreword iv Introduction. v 1 Scope 1 2 Normative references. 1 3 The new typedefs 2
9、 4 Encoding . 3 5 String literals and character constants. 4 5.1 String literals and character constants notations. 4 5.2 The string concatenation. 4 6 Library functions. 5 6.1 The mbrtoc16 function . 5 6.2 The c16rtomb function . 6 6.3 The mbrtoc32 function . 7 6.4 The c32rtomb function . 8 7 ANNEX
10、 A Unicode encoding forms: UTF-16, UTF-32 . 9 ISO/IEC TR 19769:2004(E) ISO/IEC 2004 All rights reserved iiiCopyright International Organization for Standardization Reproduced by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-ISO/IEC TR 197
11、69:2004(E) iv ISO/IEC 2004 All rights reservedForeword ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Commission) form the specialized system for worldwide standardization. National bodies that are members of ISO or IEC participate in the develop
12、ment of International Standards through technical committees established by the respective organization to deal with particular fields of technical activity. ISO and IEC technical committees collaborate in fields of mutual interest. Other international organizations, governmental and non-governmenta
13、l, in liaison with ISO and IEC, also take part in the work. In the field of information technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC 1. International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2. The main task of the
14、joint technical committee is to prepare International Standards. Draft International Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication as an International Standard requires approval by at least 75 % of the national bodies casting a vote. In
15、exceptional circumstances, the joint technical committee may propose the publication of a Technical Report of one of the following types: type 1, when the required support cannot be obtained for the publication of an International Standard, despite repeated efforts; type 2, when the subject is still
16、 under technical development or where for any other reason there is the future but not immediate possibility of an agreement on an International Standard; type 3, when the joint technical committee has collected data of a different kind from that which is normally published as an International Stand
17、ard (“state of the art”, for example). Technical Reports of types 1 and 2 are subject to review within three years of publication, to decide whether they can be transformed into International Standards. Technical Reports of type 3 do not necessarily have to be reviewed until the data they provide ar
18、e considered to be no longer valid or useful. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights. ISO/IEC TR 19769, which is a Technical Report o
19、f type 2, was prepared by Joint Technical Committee ISO/IEC JTC 1, Information technology, Subcommittee SC 22, Programming languages, their environments and system software interfaces. Copyright International Organization for Standardization Reproduced by IHS under license with ISO Not for ResaleNo
20、reproduction or networking permitted without license from IHS-,-Introduction The C language has evolved over the last decades, various code pages and multibyte libraries have been introduced, and extended character set support has been introduced; however, the support for extended character data typ
21、es in the C language is still limited. Today, the introduction and the success of the Unicode/ISO10646 standard and of its implementation in modern computer languages create ever increasing demands on the C language to give Unicode/ISO10646 better support. This paper addresses the introduction of ne
22、w extended character data types in the C language in order to support future character encoding forms, including Unicode/ISO10646. The Unicode standard supports 3 encoding forms: UTF-8 UTF-16 UTF-32 Each encoding form has advantages and disadvantages, so the choice of the encoding form should be lef
23、t to the application. Currently, some C applications implement UTF-8 using char type, UTF-16 using unsigned short or wchar_t, and UTF-32 using unsigned long or wchar_t. The current situation, however, faces the following major problems: The size of wchar_t is implementation defined. While wchar_t of
24、fers a form of platform portability for C applications, Unicode offers the possibility to write platform independent applications using a platform independent data format. There is no string literal for 16- or 32-bit based integer types, but the Unicode encoding forms require string literals. It is
25、sensible to give all the Unicode encoding forms appropriate data type support. UTF-8 is normally considered as the preferred multibyte encoding, for sequences of one or more elements of type char. This paper suggests the implementation of 16 and 32 bit character data types: char16_t and char32_t. Th
26、e new data types guarantee program portability through clearly defined character widths. The encoding of the new data types should be as generic as possible in order to support not only Unicode but also other character encodings. It is generally desirable that C applications process entire strings a
27、t once rather than process individual characters in isolation. This paper does not specify the detail of library functions for the new data types, except one set of character conversion functions. ISO/IEC TR 19769:2004(E) ISO/IEC 2004 All rights reserved vCopyright International Organization for Sta
28、ndardization Reproduced by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-Copyright International Organization for Standardization Reproduced by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without licen
29、se from IHS-,-1 Scope This Technical Report specifies two extended character data types as an extension to the programming language C, specified by the international standard ISO/IEC 9899:1999. 2 Normative references The following referenced documents are indispensable for the application of this do
30、cument. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies.ISO/IEC 9899:1999, Programming Languages CISO/IEC 10646-1:2000, Information technology Universal multiple-octet coded character set (
31、UCS) Part 1: Architecture and Basic Multilingual Plane TECHNICAL REPORT ISO/IEC TR 19769:2004(E)Information technology Programming languages, their environments and system software interfaces Extensions for the programming language C to support new character data types ISO/IEC 2004 All rights reserv
32、ed 1Copyright International Organization for Standardization Reproduced by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-3 The new typedefs This Technical Report introduces the following two new typedefs, char16_t and char32_t : typedef T
33、1 char16_t; typedef T2 char32_t; where T1 has the same type as uint_least16_t and T2 has the same type as uint_least32_t. The new typedefs guarantee certain widths for the data types, whereas the width of wchar_t is implementation defined. The data values are unsigned, while char and wchar_t could t
34、ake signed values. This Technical Report also introduces the new header: The new typedefs, char16_t and char32_t, are defined in ISO/IEC TR 19769:2004(E) 2 ISO/IEC 2004 All rights reservedCopyright International Organization for Standardization Reproduced by IHS under license with ISO Not for Resale
35、No reproduction or networking permitted without license from IHS-,-4 Encoding C99 subclause 6.10.8 specifies that the value of the macro _ _STDC_ISO_10646_ _ shall be “an integer constant of the form yyyymmL (for example, 199712L), intended to indicate that values of type wchar_t are the coded repre
36、sentations of the characters defined by ISO/IEC 10646, along with all amendments and technical corrigenda as of the specified year and month.“ C99 subclause 6.4.5p5 specifies that wide string literals are initialized with a sequence of wide characters as defined by the mbstowcs function with an impl
37、ementation-defined current locale. Analogous to this macro, this Technical Report introduces two new macros. If the header defines the macro _ _STDC_UTF_16_ _, values of type char16_t shall have UTF-16 encoding. This allows the use of UTF-16 in char16_t even when wchar_t uses a non-Unicode encoding.
38、 In certain cases the compile-time conversion to UTF-16 may be restricted to members of the basic character set and universal character names (Unnnnnnnn and unnnn) because for these the conversion to UTF-16 is defined unambiguously. If the header defines the macro _ _STDC_UTF_32_ _, values of type c
39、har32_t shall have UTF-32 encoding. If the header does not define the macro _ _STDC_UTF_16_ _, the encoding of char16_t is implementation defined. Similarly, if the header does not define the macro _ _STDC_UTF_32_ _, the encoding of char32_t is implementation defined. An implementation may define ot
40、her macros to indicate a different encoding. ISO/IEC TR 19769:2004(E) ISO/IEC 2004 All rights reserved 3Copyright International Organization for Standardization Reproduced by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-5 String literals
41、 and character constants 5.1 String literals and character constants notations The notations for string literals and character constants for char16_t are defined analogous to the wide character string literals and wide character constants: u“s-char-sequence“ denotes a char16_t type string literal an
42、d initializes an array of char16_t. The corresponding character constant is denoted by uc-char-sequence and has the type char16_t. Likewise, the string literal and character constant for char32_t are, U“s-char-sequence“ and Uc-char-sequence. 5.2 The string concatenation String literals with the new
43、format can be concatenated. If both strings have the same format, the resulting concatenated string has that format. If one string has no prefix, it is treated as a string of the same format as the other operand. (u“str“ and U“str“) Any other concatenations are implementation-defined (they might or
44、might not be supported). Here are some examples of valid concatenations: u“a“ u“b“ u“ab“ U“a“ U“b“ U“ab“ L“a“ L“b“ L“ab“ u“a“ “b“ u“ab“ U“a“ “b“ U“ab“ L“a“ “b“ L“ab“ “a“ u“b“ u“ab“ “a“ U“b“ U“ab“ “a“ L“b“ L“ab“ ISO/IEC TR 19769:2004(E) 4 ISO/IEC 2004 All rights reservedCopyright International Organi
45、zation for Standardization Reproduced by IHS under license with ISO Not for ResaleNo reproduction or networking permitted without license from IHS-,-6 Library functions Speaking in general, it is desirable to free the C applications from character-based operations and encourage string-based operatio
46、ns. Details of the library for the new character data types are left to future enhancements of the C standard. This Technical Report specifies merely the four minimum character conversions among 3 character data types: char, char16_t and char32_t. 6.1 The mbrtoc16 function Synopsis #include size_t m
47、brtoc16(char16_t * restrict pc16, const char * restrict s, size_t n, mbstate_t * restrict ps); Description If s is a null pointer, the mbrtoc16 function is equivalent to the call: mbrtoc16(NULL, “, 1, ps) In this case, the values of the parameters pc16 and n are ignored. If s is not a null pointer,
48、the mbrtoc16 function inspects at most n bytes beginning with the byte pointed to by s to determine the number of bytes needed to complete the next multibyte character (including any shift sequences). If the function determines that the next multibyte character is complete and valid, it determines the value of the corresponding wide character and then, if pc16 is not a null pointer, stores that value in the object pointed to by pc16. If the corresponding wide character is the null wid