BS ISO 24614-1-2010 Language resource management Word segmentation of written texts Basic concepts and general principles《语言资源管理 书面文本自动分词 基本概念和一般原则》.pdf
《BS ISO 24614-1-2010 Language resource management Word segmentation of written texts Basic concepts and general principles《语言资源管理 书面文本自动分词 基本概念和一般原则》.pdf》由会员分享,可在线阅读,更多相关《BS ISO 24614-1-2010 Language resource management Word segmentation of written texts Basic concepts and general principles《语言资源管理 书面文本自动分词 基本概念和一般原则》.pdf(26页珍藏版)》请在麦多课文档分享上搜索。
1、raising standards worldwideNO COPYING WITHOUT BSI PERMISSION EXCEPT AS PERMITTED BY COPYRIGHT LAWBSI Standards PublicationBS ISO 24614-1:2010Language resourcemanagement Wordsegmentation of written textsPart 1: Basic concepts and generalprinciplesBS ISO 24614-1:2010 BRITISH STANDARDNational forewordT
2、his British Standard is the UK implementation of ISO 24614-1:2010.The UK participation in its preparation was entrusted to TechnicalCommittee TS/1, Terminology.A list of organizations represented on this committee can beobtained on request to its secretary.This publication does not purport to includ
3、e all the necessaryprovisions of a contract. Users are responsible for its correctapplication. BSI 2010ISBN 978 0 580 66210 2ICS 01.140.10Compliance with a British Standard cannot confer immunity fromlegal obligations.This British Standard was published under the authority of theStandards Policy and
4、 Strategy Committee on 30 November 2010.Amendments issued since publicationDate Text affectedBS ISO 24614-1:2010Reference numberISO 24614-1:2010(E)ISO 2010INTERNATIONAL STANDARD ISO24614-1First edition2010-11-01Language resource management Word segmentation of written texts Part 1: Basic concepts an
5、d general principles Gestion des ressources langagires Segmentation des mots dans les textes crits Partie 1: Notions fondamentales et principes gnraux BS ISO 24614-1:2010ISO 24614-1:2010(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobes licensing policy, this
6、file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobes licensing policy. The ISO Central Secretari
7、at accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care has been taken to ensure
8、that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. COPYRIGHT PROTECTED DOCUMENT ISO 2010 All rights reserved. Unless otherwise specified, no part of this publication m
9、ay be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel
10、. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO 2010 All rights reservedBS ISO 24614-1:2010ISO 24614-1:2010(E) ISO 2010 All rights reserved iiiContents Page Foreword iv Introduction.v 1 Scope1 2 Terms and definitions .2 3 Basic framew
11、ork for word segmentation6 4 General principles of word segmentation.10 Annex A (informative) Representing word segmentation in XML13 Bibliography14 BS ISO 24614-1:2010ISO 24614-1:2010(E) iv ISO 2010 All rights reservedForeword ISO (the International Organization for Standardization) is a worldwide
12、federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that c
13、ommittee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. International Standards are drafted in acco
14、rdance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard requi
15、res approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO 24614-1 was prepared by Techni
16、cal Committee ISO/TC 37, Terminology and other language and content resources, Subcommittee SC 4, Language resource management. ISO 24614 consists of the following parts, under the general title Language resource management Word segmentation of written texts: Part 1: Basic concepts and general princ
17、iples Part 2: Word segmentation for Chinese, Japanese and Korean Word segmentation for other languages is to form the subject of a future Part 3. BS ISO 24614-1:2010ISO 24614-1:2010(E) ISO 2010 All rights reserved vIntroduction Word segmentation is the dividing of text into linguistic units that car
18、ry meaning. For example, “the white house” can be divided into three meaningful units, “the,” “white,” and “house”, when it refers to a house that is white; whereas “the White House” corresponds to only one meaningful unit when it refers to the residence of the US President. For the purposes of ISO
19、24614, such meaningful linguistic units are called word segmentation units (WSU). As demonstrated in the previous example, a WSU can be comprised of more than one word. A WSU can consist of a stem and affixes (e.g. “re+work+ing”). It can be a compound word (e.g. “blackboard”), a proper noun (e.g. “C
20、ape Town”), an idiom (e.g. “Its raining cats and dogs”), or a multiword expression (e.g. “take care of”). For languages that have spaces between words, such as English, segmenting a text into WSU is facilitated by using the spaces as a basis for establishing the boundaries of a WSU, although additio
21、nal considerations need to be taken into account for handling abbreviations, punctuation and multiword units of meaning, among others. For languages that do not have spaces between words, such as Chinese and Japanese, or for languages that have spaces partially between words, such as Thai and Korean
22、, segmenting a text into WSU requires a different approach. Furthermore, word segmentation is complex for languages that are characterized by extensive compounding, such as Chinese, and for languages that are characterized by extensive agglutination, such as Japanese, Korean and Hungarian. On the ot
23、her hand, the fact that Japanese supports multiple scripts is beneficial for word segmentation. However, white space alone is not sufficient to segment a text. “Apple pie,” for example, is understood as a kind of pie made of apples, so “apple” and “pie” are treated as two distinct WSUs. Alternativel
24、y, it can be viewed as a single entity due to its collocational and idiomatic properties, and treated as a single WSU. Segmentation rules can differ between languages, even when applied to equivalent expressions (as discussed in ISO 24614-2). Elaborating standards for the rules and methods for word
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
10000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- BSISO2461412010LANGUAGERESOURCEMANAGEMENTWORDSEGMENTATIONOFWRITTENTEXTSBASICCONCEPTSANDGENERALPRINCIPLES

链接地址:http://www.mydoc123.com/p-586680.html