欢迎来到麦多课文档分享! | 帮助中心 海量文档,免费浏览,给你所需,享你所想!
麦多课文档分享
全部分类
  • 标准规范>
  • 教学课件>
  • 考试资料>
  • 办公文档>
  • 学术论文>
  • 行业资料>
  • 易语言源码>
  • ImageVerifierCode 换一换
    首页 麦多课文档分享 > 资源分类 > PDF文档下载
    分享到微信 分享到微博 分享到QQ空间

    BS ISO 28500-2017 Information and documentation WARC file format《信息和文献工作 WARC文件格式》.pdf

    • 资源ID:586892       资源大小:2.03MB        全文页数:36页
    • 资源格式: PDF        下载积分:10000积分
    快捷下载 游客一键下载
    账号登录下载
    微信登录下载
    二维码
    微信扫一扫登录
    下载资源需要10000积分(如需开发票,请勿充值!)
    邮箱/手机:
    温馨提示:
    如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如需开发票,请勿充值!如填写123,账号就是123,密码也是123。
    支付方式: 支付宝扫码支付    微信扫码支付   
    验证码:   换一换

    加入VIP,交流精品资源
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    BS ISO 28500-2017 Information and documentation WARC file format《信息和文献工作 WARC文件格式》.pdf

    1、Information and documentation WARC file formatBS ISO 28500:2017BSI Standards PublicationWB11885_BSI_StandardCovs_2013_AW.indd 1 15/05/2013 15:06 ISO 2017Information and documentation WARC file formatInformation et documentation Format de fichier WARCINTERNATIONAL STANDARDISO28500Second edition2017-0

    2、8Reference numberISO 28500:2017(E)National forewordThis British Standard is the UK implementation of ISO 28500:2017. It supersedes BS ISO 28500:2009, which is withdrawn.The UK participation in its preparation was entrusted to Technical Committee IDT/2/7, Computer applications in Information and Docu

    3、mentation.A list of organizations represented on this committee can be obtained on request to its secretary.This publication does not purport to include all the necessary provisions of a contract. Users are responsible for its correct application. The British Standards Institution 2017 Published by

    4、BSI Standards Limited 2017ISBN 978 0 580 95168 8ICS 35.240.30Compliance with a British Standard cannot confer immunity from legal obligations.This British Standard was published under the authority of the Standards Policy and Strategy Committee on 30 September 2017.Amendments/corrigenda issued since

    5、 publicationDate Text affectedBRITISH STANDARDBS ISO 28500:2017 ISO 2017Information and documentation WARC file formatInformation et documentation Format de fichier WARCINTERNATIONAL STANDARDISO28500Second edition2017-08Reference numberISO 28500:2017(E)BS ISO 28500:2017ISO 28500:2017(E)ii ISO 2017 A

    6、ll rights reservedCOPYRIGHT PROTECTED DOCUMENT ISO 2017, Published in SwitzerlandAll rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on the inter

    7、net or an intranet, without prior written permission. Permission can be requested from either ISO at the address below or ISOs member body in the country of the requester.ISO copyright officeCh. de Blandonnet 8 CP 401CH-1214 Vernier, Geneva, SwitzerlandTel. +41 22 749 01 11Fax +41 22 749 09 47copyri

    8、ghtiso.orgwww.iso.orgBS ISO 28500:2017ISO 28500:2017(E)ii ISO 2017 All rights reservedCOPYRIGHT PROTECTED DOCUMENT ISO 2017, Published in SwitzerlandAll rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form or by any means, elect

    9、ronic or mechanical, including photocopying, or posting on the internet or an intranet, without prior written permission. Permission can be requested from either ISO at the address below or ISOs member body in the country of the requester.ISO copyright officeCh. de Blandonnet 8 CP 401CH-1214 Vernier

    10、, Geneva, SwitzerlandTel. +41 22 749 01 11Fax +41 22 749 09 47copyrightiso.orgwww.iso.orgISO 28500:2017(E)Foreword vIntroduction vi1 Scope . 12 Normative references 13 Terms, definitions and abbreviated terms 24 File and record model . 35 Named fields . 55.1 General . 55.2 WARC-Record-ID (mandatory)

    11、 . 55.3 Content-Length (mandatory) . 55.4 WARC-Date (mandatory) . 65.5 WARC-Type (mandatory) . 65.6 Content-Type . 65.7 WARC-Concurrent-To . 75.8 WARC-Block-Digest . 75.9 WARC-Payload-Digest . 75.10 WARC-IP-Address . 85.11 WARC-Refers-To . 85.12 WARC-Refers-To-Target-URI 85.13 WARC-Refers-To-Date 85

    12、.14 WARC-Target-URI 95.15 WARC-Truncated 95.16 WARC-Warcinfo-ID 95.17 WARC-Filename 95.18 WARC-Profile . 105.19 WARC-Identified-Payload-Type . 105.20 WARC-Segment-Number . 105.21 WARC-Segment-Origin-ID 105.22 WARC-Segment-Total-Length 106 WARC record types 116.1 General 116.2 warcinfo . 116.3 respon

    13、se 116.3.1 General. 116.3.2 http and https schemes 126.3.3 Other URI schemes 126.4 resource . 126.4.1 General. 126.4.2 http and https schemes 126.4.3 ftp scheme 126.4.4 dns scheme 136.4.5 Other URI schemes 136.5 request 136.5.1 General. 136.5.2 http and https schemes 136.5.3 Other URI schemes 136.6

    14、metadata . 136.7 revisit . 146.7.1 General. 146.7.2 Profile: Identical Payload Digest . 146.7.3 Profile: Server Not Modified . 156.7.4 Other profiles .15 ISO 2017 All rights reserved iiiContents PageBS ISO 28500:2017ISO 28500:2017(E)6.8 conversion . 156.9 continuation . 167 Record segmentation 168 W

    15、ARC file name, size and compression 16Annex A (informative) Use cases for writing WARC records .18Annex B (informative) Examples of WARC records .21Annex C (informative) WARC file size and name recommendations 24Annex D (informative) Compression recommendations 25Bibliography .26iv ISO 2017 All righ

    16、ts reservedBS ISO 28500:2017ISO 28500:2017(E)6.8 conversion . 156.9 continuation . 167 Record segmentation 168 WARC file name, size and compression 16Annex A (informative) Use cases for writing WARC records .18Annex B (informative) Examples of WARC records .21Annex C (informative) WARC file size and

    17、 name recommendations 24Annex D (informative) Compression recommendations 25Bibliography .26iv ISO 2017 All rights reserved ISO 28500:2017(E)ForewordISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of prepar

    18、ing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in li

    19、aison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization.The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Di

    20、rectives, Part 1. In particular the different approval criteria needed for the different types of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see www .iso .org/ directives).Attention is drawn to the possibility th

    21、at some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of any patent rights identified during the development of the document will be in the Introduction and/or on the ISO list of patent d

    22、eclarations received (see www .iso .org/ patents).Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement.For an explanation on the voluntary nature of standards, the meaning of ISO specific terms and expressions related to confor

    23、mity assessment, as well as information about ISOs adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT) see the following URL: www .iso .org/ iso/ foreword .html.This document was prepared by Technical Committee ISO/TC 46, Information and documentation,

    24、 Subcommittee 4, Technical interoperability.This second edition cancels and replaces the first edition (ISO 28500:2009), which has been technically revised. ISO 2017 All rights reserved vBS ISO 28500:2017ISO 28500:2017(E)IntroductionWebsites and web pages emerge and disappear from the World Wide Web

    25、 every day. For the past 10 years, memory storage organizations have tried to find the most appropriate ways to collect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program that browses the web in an automated manner accord

    26、ing to a set of policies; starting with a list of URLs, it saves each page identified by a URL, finds all the hyperlinks in the page (e.g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the bi

    27、llions of saved web page objects itself presents a challenge.At the same time, those same organizations have a rising need to archive large numbers of digital files not necessarily captured from the web (e.g. entire series of electronic journals, or data generated by environmental sensing equipment)

    28、. A general requirement that appears to be emerging is for a container format that permits one file simply and safely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) need to be of unrestricted type (incl

    29、uding many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs only minimal knowledge of the nature of the objects.The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simp

    30、le text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC file format (ARC) that has traditionally been used to store “web crawls” as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line

    31、header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file has been used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national lib

    32、raries.The motivation to extend the ARC format arose from the discussion and experiences of the International Internet Preservation Consortium (IIPC), whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK)

    33、, The Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos National Laboratory also provided input on extending and generalizing the format.The WARC format offers a standard way to structure, manage and store billions of resources collected from

    34、 the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining and exchanging content. While it represents the unique standard format for web archives, it has been adopted beyond the web archiving community to store born-digital or digitized materials. The way W

    35、ARC files will be created and resources stored and rendered will depend on software and applications implementations.Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later

    36、-date transformations, and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.The WARC file format is

    37、made sufficiently different from the legacy ARC format files so that software tools can unambiguously detect and correctly process both WARC and ARC records; given the large amount of existing archival data in the previous ARC format, it is important that access and use of this legacy not be interru

    38、pted when transitioning to the WARC format.vi ISO 2017 All rights reservedBS ISO 28500:2017ISO 28500:2017(E)IntroductionWebsites and web pages emerge and disappear from the World Wide Web every day. For the past 10 years, memory storage organizations have tried to find the most appropriate ways to c

    39、ollect and keep track of this vast quantity of important material using web-scale tools such as web crawlers. A web crawler is a program that browses the web in an automated manner according to a set of policies; starting with a list of URLs, it saves each page identified by a URL, finds all the hyp

    40、erlinks in the page (e.g. links to other pages, images, videos, scripting or style instructions, etc.), and adds them to the list of URLs to visit recursively. Storing and managing the billions of saved web page objects itself presents a challenge.At the same time, those same organizations have a ri

    41、sing need to archive large numbers of digital files not necessarily captured from the web (e.g. entire series of electronic journals, or data generated by environmental sensing equipment). A general requirement that appears to be emerging is for a container format that permits one file simply and sa

    42、fely to carry a very large number of constituent data objects for the purpose of storage, management, and exchange. Those data objects (or resources) need to be of unrestricted type (including many binary types for audio, CAD, compressed files, etc.), but fortunately the container needs only minimal

    43、 knowledge of the nature of the objects.The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC file f

    44、ormat (ARC) that has traditionally been used to store “web crawls” as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retriev

    45、al protocol response messages and content. The original ARC format file has been used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries.The motivation to extend the ARC format arose from the discussion and experiences of the International Int

    46、ernet Preservation Consortium (IIPC), whose members include the national libraries of Australia, Canada, Denmark, Finland, France, Iceland, Italy, Norway, Sweden, The British Library (UK), The Library of Congress (USA), and the Internet Archive (IA). The California Digital Library and the Los Alamos

    47、 National Laboratory also provided input on extending and generalizing the format.The WARC format offers a standard way to structure, manage and store billions of resources collected from the web and elsewhere. It is used to build applications for harvesting, managing, accessing, mining and exchangi

    48、ng content. While it represents the unique standard format for web archives, it has been adopted beyond the web archiving community to store born-digital or digitized materials. The way WARC files will be created and resources stored and rendered will depend on software and applications implementati

    49、ons.Besides the primary content recorded in ARCs, the extended WARC format accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, later-date transformations, and segmentation of large resources. The extension may also be useful for more general applications than web archiving. To aid the development of tools that are backwards compatible, WARC content is clearly distinguishable from pre-revision ARC content.The WARC file format is made sufficiently differe


    注意事项

    本文(BS ISO 28500-2017 Information and documentation WARC file format《信息和文献工作 WARC文件格式》.pdf)为本站会员(appealoxygen216)主动上传,麦多课文档分享仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文档分享(点击联系客服),我们立即给予删除!




    关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

    copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
    备案/许可证编号:苏ICP备17064731号-1 

    收起
    展开