ISO 28500-2009 Information and documentation - WARC file format《信息和文献 WARC文件格式》.pdf
《ISO 28500-2009 Information and documentation - WARC file format《信息和文献 WARC文件格式》.pdf》由会员分享,可在线阅读,更多相关《ISO 28500-2009 Information and documentation - WARC file format《信息和文献 WARC文件格式》.pdf(36页珍藏版)》请在麦多课文档分享上搜索。
1、 Reference numberISO 28500:2009(E)ISO 2009INTERNATIONAL STANDARD ISO28500First edition2009-05-15Information and documentation WARC file format Information et documentation Format de fichier WARC ISO 28500:2009(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobes
2、licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobes licensing policy. Th
3、e ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care ha
4、s been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. COPYRIGHT PROTECTED DOCUMENT ISO 2009 All rights reserved. Unless otherwise specified, no par
5、t of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 5
6、6 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO 2009 All rights reservedISO 28500:2009(E) ISO 2009 All rights reserved iiiContents Page Foreword. v Introduction . vi 1 Scope . 1 2 Normative references . 1 3 Term
7、s, definitions and abbreviated terms . 2 3.1 Terms and definitions. 2 3.2 Abbreviated terms 2 4 File and record model. 3 5 Named fields 5 5.1 General. 5 5.2 WARC-Record-ID (mandatory) 6 5.3 Content-Length (mandatory) . 6 5.4 WARC-Date (mandatory) 6 5.5 WARC-Type (mandatory) . 6 5.6 Content-Type. 7 5
8、.7 WARC-Concurrent-To. 7 5.8 WARC-Block-Digest 8 5.9 WARC-Payload-Digest 8 5.10 WARC-IP-Address. 8 5.11 WARC-Refers-To. 9 5.12 WARC-Target-URI . 9 5.13 WARC-Truncated 9 5.14 WARC-Warcinfo-ID . 10 5.15 WARC-Filename 10 5.16 WARC-Profile 10 5.17 WARC-Identified-Payload-Type. 10 5.18 WARC-Segment-Numbe
9、r 10 5.19 WARC-Segment-Origin-ID 11 5.20 WARC-Segment-Total-Length . 11 6 WARC record types 11 6.1 General. 11 6.2 warcinfo 11 6.3 response 12 6.4 resource . 13 6.5 request . 13 6.6 metadata. 14 6.7 revisit 15 6.8 conversion . 16 6.9 continuation. 16 7 Record segmentation . 16 8 Registration of MIME
10、 media types application/warc and application/warc-fields . 17 8.1 General. 17 8.2 application/warc 17 8.3 application/warc-fields . 18 9 WARC file name, size and compression 18 Annex A (informative) Use cases for writing WARC records 19 ISO 28500:2009(E) iv ISO 2009 All rights reservedAnnex B (info
11、rmative) Examples of WARC records 22 Annex C (informative) WARC file size and name recommendations 26 Annex D (informative) Compression recommendations 27 Bibliography . 28 ISO 28500:2009(E) ISO 2009 All rights reserved vForeword ISO (the International Organization for Standardization) is a worldwid
12、e federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that
13、 committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. International Standards are drafted in ac
14、cordance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard req
15、uires approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO 28500 was prepared by Techni
16、cal Committee ISO/TC 46, Information and documentation, Subcommittee SC 4, Technical interoperability. ISO 28500:2009(E) vi ISO 2009 All rights reservedIntroduction Websites and web pages emerge and disappear from the World Wide Web every day. For the past ten years, memory storage organizations hav
17、e tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it saves ea
18、ch page identified by a URL, finds all the hyperlinks in the page (e.g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge. At th
19、e same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g. entire series of electronic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a conta
20、iner format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) need to be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), b
21、ut fortunately the container needs only minimal knowledge of the nature of the objects. The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. T
22、he WARC format is an extension of the ARC file format (ARC) that has traditionally been used to store “web crawls“ as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its
23、length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file has been used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries. The motivation to extend the ARC format arose from the di
24、scussion and experiences of the International Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA).
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
10000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- ISO285002009INFORMATIONANDDOCUMENTATIONWARCFILEFORMAT 信息 文献 WARC 文件格式 PDF

链接地址:http://www.mydoc123.com/p-1253022.html