欢迎来到麦多课文档分享! | 帮助中心 海量文档,免费浏览,给你所需,享你所想!
麦多课文档分享
全部分类
  • 标准规范>
  • 教学课件>
  • 考试资料>
  • 办公文档>
  • 学术论文>
  • 行业资料>
  • 易语言源码>
  • ImageVerifierCode 换一换
    首页 麦多课文档分享 > 资源分类 > PDF文档下载
    分享到微信 分享到微博 分享到QQ空间

    ISO 28500-2009 Information and documentation - WARC file format《信息和文献 WARC文件格式》.pdf

    • 资源ID:1253022       资源大小:293.76KB        全文页数:36页
    • 资源格式: PDF        下载积分:10000积分
    快捷下载 游客一键下载
    账号登录下载
    微信登录下载
    二维码
    微信扫一扫登录
    下载资源需要10000积分(如需开发票,请勿充值!)
    邮箱/手机:
    温馨提示:
    如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如需开发票,请勿充值!如填写123,账号就是123,密码也是123。
    支付方式: 支付宝扫码支付    微信扫码支付   
    验证码:   换一换

    加入VIP,交流精品资源
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    ISO 28500-2009 Information and documentation - WARC file format《信息和文献 WARC文件格式》.pdf

    1、 Reference numberISO 28500:2009(E)ISO 2009INTERNATIONAL STANDARD ISO28500First edition2009-05-15Information and documentation WARC file format Information et documentation Format de fichier WARC ISO 28500:2009(E) PDF disclaimer This PDF file may contain embedded typefaces. In accordance with Adobes

    2、licensing policy, this file may be printed or viewed but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the editing. In downloading this file, parties accept therein the responsibility of not infringing Adobes licensing policy. Th

    3、e ISO Central Secretariat accepts no liability in this area. Adobe is a trademark of Adobe Systems Incorporated. Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-creation parameters were optimized for printing. Every care ha

    4、s been taken to ensure that the file is suitable for use by ISO member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address given below. COPYRIGHT PROTECTED DOCUMENT ISO 2009 All rights reserved. Unless otherwise specified, no par

    5、t of this publication may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 5

    6、6 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Published in Switzerland ii ISO 2009 All rights reservedISO 28500:2009(E) ISO 2009 All rights reserved iiiContents Page Foreword. v Introduction . vi 1 Scope . 1 2 Normative references . 1 3 Term

    7、s, definitions and abbreviated terms . 2 3.1 Terms and definitions. 2 3.2 Abbreviated terms 2 4 File and record model. 3 5 Named fields 5 5.1 General. 5 5.2 WARC-Record-ID (mandatory) 6 5.3 Content-Length (mandatory) . 6 5.4 WARC-Date (mandatory) 6 5.5 WARC-Type (mandatory) . 6 5.6 Content-Type. 7 5

    8、.7 WARC-Concurrent-To. 7 5.8 WARC-Block-Digest 8 5.9 WARC-Payload-Digest 8 5.10 WARC-IP-Address. 8 5.11 WARC-Refers-To. 9 5.12 WARC-Target-URI . 9 5.13 WARC-Truncated 9 5.14 WARC-Warcinfo-ID . 10 5.15 WARC-Filename 10 5.16 WARC-Profile 10 5.17 WARC-Identified-Payload-Type. 10 5.18 WARC-Segment-Numbe

    9、r 10 5.19 WARC-Segment-Origin-ID 11 5.20 WARC-Segment-Total-Length . 11 6 WARC record types 11 6.1 General. 11 6.2 warcinfo 11 6.3 response 12 6.4 resource . 13 6.5 request . 13 6.6 metadata. 14 6.7 revisit 15 6.8 conversion . 16 6.9 continuation. 16 7 Record segmentation . 16 8 Registration of MIME

    10、 media types application/warc and application/warc-fields . 17 8.1 General. 17 8.2 application/warc 17 8.3 application/warc-fields . 18 9 WARC file name, size and compression 18 Annex A (informative) Use cases for writing WARC records 19 ISO 28500:2009(E) iv ISO 2009 All rights reservedAnnex B (info

    11、rmative) Examples of WARC records 22 Annex C (informative) WARC file size and name recommendations 26 Annex D (informative) Compression recommendations 27 Bibliography . 28 ISO 28500:2009(E) ISO 2009 All rights reserved vForeword ISO (the International Organization for Standardization) is a worldwid

    12、e federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that

    13、 committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. International Standards are drafted in ac

    14、cordance with the rules given in the ISO/IEC Directives, Part 2. The main task of technical committees is to prepare International Standards. Draft International Standards adopted by the technical committees are circulated to the member bodies for voting. Publication as an International Standard req

    15、uires approval by at least 75 % of the member bodies casting a vote. Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. ISO 28500 was prepared by Techni

    16、cal Committee ISO/TC 46, Information and documentation, Subcommittee SC 4, Technical interoperability. ISO 28500:2009(E) vi ISO 2009 All rights reservedIntroduction Websites and web pages emerge and disappear from the World Wide Web every day. For the past ten years, memory storage organizations hav

    17、e tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it saves ea

    18、ch page identified by a URL, finds all the hyperlinks in the page (e.g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge. At th

    19、e same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g. entire series of electronic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a conta

    20、iner format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) need to be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), b

    21、ut fortunately the container needs only minimal knowledge of the nature of the objects. The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. T

    22、he WARC format is an extension of the ARC file format (ARC) that has traditionally been used to store “web crawls“ as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its

    23、length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file has been used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries. The motivation to extend the ARC format arose from the di

    24、scussion and experiences of the International Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA).

    25、The California Digital Library and the Los Alamos National Laboratory also provided input on extending and generalizing the format. The WARC format is expected to be a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It will be used to build app

    26、lications for harvesting (such as the open source Heritrix web crawler), managing, accessing, and exchanging content. The way WARC files will be created and resources stored and rendered will depend on software and applications implementations. Besides the primary content recorded in ARCs, the exten

    27、ded WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of too

    28、ls that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content. The WARC file format is made sufficiently different from the legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the lar

    29、ge amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interrupted when transitioning to the WARC format. After the Internet Engineering Steering Group (IESG: http:/www.ietf.org/iesg.html) approval, IANA (Internet Assigned Numbers Au

    30、thority: http:/www.iana.org/) is expected to register the WARC type “application/warc“ using the application provided in this International Standard and following procedures defined in RFC2048. INTERNATIONAL STANDARD ISO 28500:2009(E) ISO 2009 All rights reserved 1Information and documentation WARC

    31、file format 1 Scope This International Standard specifies the WARC file format: to store both the payload content and control information from mainstream Internet application layer protocols, such as the HTTP, DNS, and FTP; to store arbitrary metadata linked to other stored data (e.g. subject classi

    32、fier, discovered language, encoding); to support data compression and maintain data record integrity; to store all control information from the harvesting protocol (e.g. request headers), not just response information; to store the results of data transformations linked to other stored data; to stor

    33、e a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources); to be extended without disruption to existing functionality; to support handling of overly long records by truncation or segmentation, where desired. 2 Norma

    34、tive references The following referenced documents are indispensable for the application of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. ISO 8601, Data elements and inter

    35、change formats Information interchange Representation of dates and times RFC1035 Mockapetris, P. Domain names Implementation and specification. STD 13, November 1987. Available at: http:/www.faqs.org/rfcs/rfc1035.html RFC1884 Hinden, R. and Deering, S. IP Version 6 Addressing Architecture. December

    36、1995. Available at: http:/www.faqs.org/rfcs/rfc1884.html RFC2045 Freed, N. and Borenstein, N. Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies. November 1996. Available at: http:/www.faqs.org/rfcs/rfc2045 RFC2540 Eastlake, D. Detached Domain Name System (DNS)

    37、Information. March 1999. Available at: http:/www.faqs.org/rfcs/rfc2540.html RFC2616 Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and Berners-Lee, T. Hypertext Transfer Protocol HTTP/1.1. June 1999 (TXT, PS, PDF, HTML, XML). Available at: http:/www.faqs.org/rfcs/rfc2616.h

    38、tml ISO 28500:2009(E) 2 ISO 2009 All rights reservedRFC2822 Resnick, P. (ed.) Internet Message Format. April 2001. Available at: http:/www.faqs.org/rfcs/rfc2822 RFC3629 Yergeau, F. UTF-8, a transformation format of ISO 10646. STD 63, November 2003. Available at: http:/www.faqs.org/rfcs/rfc3629.html

    39、RFC3986 Berners-Lee, T., Fielding, R., Masinter, L. Uniform Resource Identifier (URI): Generic Syntax. STD 66, January 2005 (TXT, HTML, XML). Available at: http:/www.faqs.org/rfcs/rfc3986.html RFC4027 Josefsson, S. Domain Name System Media Types. April 2005. Available at: http:/www.faqs.org/rfcs/rfc

    40、4027.html W3CDTF Date and Time Formats: note submitted to the W3C. 15 September 1997 (W3C profile of ISO 8601). Available at: http:/www.w3.org/TR/NOTE-datetime 3 Terms, definitions and abbreviated terms 3.1 Terms and definitions For the purposes of this document, the following terms and definitions

    41、apply. 3.1.1 WARC record basic constituent of a WARC file, consisting of a sequence of WARC records 3.1.2 WARC record content block part (zero or more octets) of a WARC record that follows the header and that forms the main body of a WARC record 3.1.3 WARC record payload data object referred to, or

    42、contained by a WARC record as a meaningful subset of the content block 3.1.4 WARC record header beginning of a WARC record, consisting of one first line declaring the record to be in the WARC format with a given version number, followed by lines of named fields up to a blank line 3.1.5 WARC named fi

    43、elds set of elements consisting of a name, a colon, and a value, with long values continued on indented lines 3.1.6 WARC logical record in the context of segmentation, a logical record may be composed of multiple segments, each represented by a WARC record 3.2 Abbreviated terms ABNF augmented Backus

    44、-Naur form ARC archive CRLF carriage return line feed ISO 28500:2009(E) ISO 2009 All rights reserved 3DNS domain name system FTP file transfer protocol HTTP hypertext transport protocol IANA Internet Assigned Numbers Authority IESG Internet Engineering Steering Group RFC request for comments UR (I/L

    45、/N) uniform resource (identifier/locator/name) WARC web archive 4 File and record model A WARC format file is the simple concatenation of one or more WARC records. The first record usually describes the records to follow. In general, record content is either the direct result of a retrieval attempt

    46、(web pages, inline images, URL redirection information, DNS hostname lookup results, stand-alone files, etc.) or is synthesized material (e.g. metadata, transformed content) that provides additional information about archived content. A WARC record shall consist of a record header followed by a reco

    47、rd content block and two new lines. The WARC record header shall consist of one first line declaring the record to be in the WARC format with a given version number, then a variable number of line-oriented named fields terminated by a blank line. The WARC record header format shall follow the genera

    48、l rules of HTTP/1.1 RFC2616 and RFC2822 headers with one major exception: it shall also allow UTF-8 characters, as specified in RFC3629. The top-level view of a WARC file can be expressed in an ABNF grammar, reusing the augmented constructs defined in section 2.1 of HTTP/1.1 RFC2616. (In particular,

    49、 note that to avoid the risk of confusion, where any WARC rule has the same name as an RFC2616 rule, the definition here has been made the same, except in the case of the CHAR rule, which in WARC includes multibyte UTF-8 characters.) warc-file = 1*warc-record warc-record = header CRLF block CRLF CRLF header = version warc-fields version = “WARC/1.0“ CRLF warc-fields = *named-field CRLF block = *OCTET The record version shall appear first in every record an


    注意事项

    本文(ISO 28500-2009 Information and documentation - WARC file format《信息和文献 WARC文件格式》.pdf)为本站会员(figureissue185)主动上传,麦多课文档分享仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文档分享(点击联系客服),我们立即给予删除!




    关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

    copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
    备案/许可证编号:苏ICP备17064731号-1 

    收起
    展开