1、 ISO 2013 Information and documentation Statistics and quality issues for web archiving Information et documentation Statistiques et indicateurs de qualit pour larchivage du web TECHNICAL REPORT ISO/TR 14873 First edition 2013-12-01 Reference number ISO/TR 14873:2013(E) ISO/TR 14873:2013(E)ii ISO 20
2、13 All rights reserved COPYRIGHT PROTECTED DOCUMENT ISO 2013 All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized otherwise in any form or by any means, electronic or mechanical, including photocopying, or posting on the internet or an intranet,
3、without prior written permission. Permission can be requested from either ISO at the address below or ISOs member body in the country of the requester. ISO copyright office Case postale 56 CH-1211 Geneva 20 Tel. + 41 22 749 01 11 Fax + 41 22 749 09 47 E-mail copyrightiso.org Web www.iso.org Publishe
4、d in Switzerland ISO/TR 14873:2013(E) ISO 2013 All rights reserved iii Contents Page Foreword iv Introduction v 1 Scope . 1 2 T erms and definitions . 1 3 Methods and purposes of Web archiving 7 3.1 Collecting methods 8 3.2 Access and description methods 10 3.3 Preservation methods 12 3.4 Legal basi
5、s for Web archiving 14 3.5 Additional reasons for Web archiving 15 4 Statistics .16 4.1 General 16 4.2 Statistics for collection development .16 4.3 Collection characterization .22 4.4 Collection usage 28 4.5 Web archive preservation .31 4.6 Measuring the costs of Web archiving .35 5 Quality indicat
6、ors .37 5.1 General 37 5.2 Limitations .37 5.3 Description 38 6 Usage and benefits .47 6.1 General 47 6.2 Intended usage and readers .47 6.3 Benefits for user groups .48 6.4 Use of proposed statistics by user groups 48 6.5 Web archiving process with related performance indicators .50 Bibliography .5
7、2 ISO/TR 14873:2013(E) Foreword ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in
8、a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commissi
9、on (IEC) on all matters of electrotechnical standardization. The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular the different approval criteria needed for the different types of ISO documents shou
10、ld be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see www.iso.org/directives). Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for id
11、entifying any or all such patent rights. Details of any patent rights identified during the development of the document will be in the Introduction and/or on the ISO list of patent declarations received (see www.iso.org/patents). Any trade name used in this document is information given for the conv
12、enience of users and does not constitute an endorsement. For an explanation on the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISOs adherence to the WTO principles in the Technical Barriers to Trade (TBT) see the following URL: Forewor
13、d - Supplementary information The committee responsible for this document is ISO/TC 46, Information and documentation, Subcommittee SC 8, Quality - Statistics and performance evalutation.iv ISO 2013 All rights reserved ISO/TR 14873:2013(E) Introduction This Technical Report was developed in response
14、 to a worldwide demand for guidelines on the management and evaluation of Web archiving activities and products. Web archiving refers to the activities of selecting, capturing, storing, preserving and managing access to snapshots of Internet resources over time. It started at the end of the 1990s, b
15、ased on the vision that an archive of Internet resources would become a vital record for research, commerce and government in the future. Internet resources are regarded as part of the cultural heritage and therefore preserved like printed heritage publications. Many institutions involved in Web arc
16、hiving see this as an extension of their long standing mission of preserving their national heritage, and this is endorsed and enabled in many countries by legislative frameworks such as legal deposit. There is a wide range of resources available on the Internet, including text, image, film, sound a
17、nd other multimedia formats. In addition to interlinked Web pages, there are newsgroups, newsletters, blogs and interactive services such as games, made available using various transfer and communication protocols. Web archives bring together copies of Internet resources, collected automatically by
18、harvesting software, usually at regular intervals. The intention is to replay the resources including the inherent relations, for example by means of hypertext links, as much as possible as they were in their original environment. The primary goal of Web archiving is to preserve a record of the Web
19、in perpetuity, as closely as possible to its original form, for various academic, professional and private purposes. Web archiving is a recent but expanding activity which continuously requires new approaches and tools in order to stay in sync with rapidly evolving Web technology. Determined by the
20、strategic importance perceived by the archiving institution, means available and sometimes legal requirements, diverse approaches have been taken to archive Internet resources, ranging from capturing individual Web pages to entire top-level domains. From an organisational perspective, Web archiving
21、is also at different levels of maturity. While it has become a business as usual activity in some organisations, others have just initiated experimental programmes to explore the challenge. Depending on the scale and purpose of collection, a distinction can be made between two broad categories of We
22、b archiving strategy: bulk harvesting and selective harvesting. Large scale bulk harvesting, such as national domain harvesting, is intended to capture a snapshot of an entire domain (or a subset of it). Selective harvesting is performed on a much smaller scale, is more focused and undertaken more f
23、requently, often based on criteria such as theme, event, format (e.g. audio or video files) or agreement with content owners. A key difference between the two strategies lies in the level of quality control, the evaluation of harvested Websites to determine whether pre-defined quality standards are
24、being attained. The scale of domain harvesting makes it impossible to carry out any manual visual comparison between the harvested and the live version of the resource, which is a common quality assurance method in selective harvesting. This Technical Report aims to demonstrate how Web archives, as
25、part of a wider heritage collection, can be measured and managed in a similar and compliant manner based on traditional library workflows. The report addresses collection development, characterization, description, preservation, usage and organisational structure, showing that most aspects of the tr
26、aditional collection management workflow remain valid in principle for Web archiving, although adjustment is required in practice. While this Technical Report provides an overview of the current status of Web archiving, its focus is on the definition and use of Web archive statistics and quality ind
27、icators. The production of some statistics relies on the use of harvesting, indexing or browsing software, and a different choice of software may lead to variance in the results. This Technical Report however does not endorse nor recommend any software in particular. It provides a set of indicators
28、to help assess the performance and quality of Web archives in general. This Technical Report should be considered as a work in progress. Some of its contents are expected to be incorporated in the future into ISO 2789 and ISO 11620. ISO 2013 All rights reserved v Information and documentation Statis
29、tics and quality issues for web archiving 1 Scope This Technical Report defines statistics, terms and quality criteria for Web archiving. It considers the needs and practices across a wide range of organisations such as libraries, archives, museums, research centres and heritage foundations. The exa
30、mples mentioned are taken from the library sector, because libraries, especially national libraries, have taken up the new task of Web archiving in the context of legal deposit. This should in no way be taken to undermine the important contributions of institutions which are not libraries. Neither d
31、oes it reduce the principal applicability of this Technical Report for heritage institutions and archiving professionals. This Technical Report is intended for professionals directly involved in Web archiving, often in mixed teams consisting of library or archive curators, engineers and managerial s
32、taff. It is also useful for Web archiving institutions funding authorities and external stakeholders. The terminology used in this Technical Report attempts to reflect the wide range of interests and expertise of the audiences, striking a balance between computer science, management and librarianshi
33、p. This Technical Report does not consider the management of academic and commercial electronic resources, such as e-journals, e-newspapers or e-books, which are usually stored and processed separately using different management systems. They are regarded as Internet resources and are not addressed
34、in this Technical Report as distinct streams of content of Web archives. Some organisations also collect electronic documents, which may be delivered through the Web, through publisher-based electronic deposits and repository systems. These too are out of scope for this Technical Report. The princip
35、les and techniques used for this kind of collecting are indeed very different from those of Web archiving; statistics and quality indicators relevant for one kind of method are not necessarily relevant for the other. Finally, this Technical Report essentially focuses on Web archiving principles and
36、methods, and does not encompass alternative ways of collecting Internet resources. As a matter of fact, some Internet resources, especially those that are not distributed on the Web (e.g. newsletters distributed as e-mails) are not harvested by Web archiving techniques and are collected by other mea
37、ns that are not described nor analysed in this Technical Report. 2 T erms a nd definiti ons For the purposes of this document, the following terms and definitions apply. 2.1 access successful request of a library-provided online service Note 1 to entry: An access is one cycle of user activities that
38、 typically starts when a user connects to a library- provided online service and ends by a terminating activity that is either explicit (by leaving the database through log-out or exit) or implicit (timeout due to user inactivity). Note 2 to entry: Accesses to the library website are counted as virt
39、ual visits. Note 3 to entry: Requests of a general entrance or gateway page are excluded. Note 4 to entry: If possible, requests by search engines are excluded. SOURCE: ISO 2789:2013, definition 2.2.1 TECHNICAL REPORT ISO/TR 14873:2013(E) ISO 2013 All rights reserved 1 ISO/TR 14873:2013(E) 2.2 acces
40、s tool specialist software used to find, retrieve and replay archived Internet resources Note 1 to entry: This may be implemented by a number of separate software packages working together. 2.3 administrative metadata information necessary to allow the proper management of the digital objects in a r
41、epository Note 1 to entry: Administrative metadata can be divided into the following categories: context or provenance metadata: describe the lifecycle of a resource to a point, including the related entities and processes, e.g. configuration and log files; technical metadata: describe the technical
42、 characteristics of a digital object, e g. its format; rights metadata: define the ownership and the legally permitted usage of an object. 2.4 archive Web archive entire set of resources crawled from the Web over time, comprising one or more collections 2.5 bit stream series of 0 and 1 digits that c
43、onstitutes a digital file 2.6 budget (crawl) limitation associated with a crawl or individual seeds, which can be expressed in e.g. number of files, volume of data, or the time to be spent per crawl as defined in the crawler settings 2.7 bulk crawl bulk harvest crawl aimed at collecting the entirety
44、 of a single or multiple top level domain(s) or a subset(s) Note 1 to entry: In comparison with selective crawls, bulk crawls have a wider scope and are typically performed less frequently. Note 2 to entry: Bulk crawls generally result in large scale Web archives, making it impossible to conduct det
45、ailed quality assurance. This is often done through sampling. 2.8 capture instance copy of a resource crawled at a certain point in time Note 1 to entry: If a resource has been crawled three times on different dates, there will be three captures. 2.9 collection Web archive collection cohesive resour
46、ces presented as a group Note 1 to entry: A collection can either be selected specifically prior to harvesting (e. g. an event, a topic) or pulled together retrospectively from available resources in the archive. Note 2 to entry: A Web archive may consist of one or more collections.2 ISO 2013 All ri
47、ghts reserved ISO/TR 14873:2013(E) 2.10 crawl harvest process of browsing and copying resources using a crawler Note 1 to entry: Crawls can be categorised as bulk or selective crawls. 2.11 crawl settings crawl parameters definition of which resources should be collected and the frequency and depth r
48、equired for each set of seeds Note 1 to entry: Crawl settings also include crawler politeness (number of requests per second or minute sent to the server hosting the resource), compliance with robots.txt and filters to exclude crawler traps. 2.12 crawler harvester archiving crawler DEPRECATED: spide
49、r software that will successively request URLs and parse the resulting resource for further URLs Note 1 to entry: Resources may be stored and URLs discarded in accordance with a predefined set of rules see crawl settings (2.11) and scope (crawl) (2.40). 2.13 crawler trap Web page (or series thereof) which will cause a crawler to either crash or endlessly follow references to other resources deemed to be of little or no value Note 1 to entry: Crawler traps could be put in place intentionally to prevent crawlers from harvesting resources. This cou