欢迎来到麦多课文档分享! | 帮助中心 海量文档,免费浏览,给你所需,享你所想!
麦多课文档分享
全部分类
  • 标准规范>
  • 教学课件>
  • 考试资料>
  • 办公文档>
  • 学术论文>
  • 行业资料>
  • 易语言源码>
  • ImageVerifierCode 换一换
    首页 麦多课文档分享 > 资源分类 > PPT文档下载
    分享到微信 分享到微博 分享到QQ空间

    Basic WWW Technologies.ppt

    • 资源ID:378856       资源大小:794.50KB        全文页数:41页
    • 资源格式: PPT        下载积分:2000积分
    快捷下载 游客一键下载
    账号登录下载
    微信登录下载
    二维码
    微信扫一扫登录
    下载资源需要2000积分(如需开发票,请勿充值!)
    邮箱/手机:
    温馨提示:
    如需开发票,请勿充值!快捷下载时,用户名和密码都是您填写的邮箱或者手机号,方便查询和重复下载(系统自动生成)。
    如需开发票,请勿充值!如填写123,账号就是123,密码也是123。
    支付方式: 支付宝扫码支付    微信扫码支付   
    验证码:   换一换

    加入VIP,交流精品资源
     
    账号:
    密码:
    验证码:   换一换
      忘记密码?
        
    友情提示
    2、PDF文件下载后,可能会被浏览器默认打开,此种情况可以点击浏览器菜单,保存网页到桌面,就可以正常下载了。
    3、本站不支持迅雷下载,请使用电脑自带的IE浏览器,或者360浏览器、谷歌浏览器下载即可。
    4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩,下载后原文更清晰。
    5、试题试卷类文档,如果标题没有明确说明有答案则都视为没有答案,请知晓。

    Basic WWW Technologies.ppt

    1、Basic WWW Technologies,2.1 Web Documents. 2.2 Resource Identifiers: URI, URL, and URN. 2.3 Protocols. 2.4 Log Files. 2.5 Search Engines.,2,What Is the World Wide Web?,The world wide web (web) is a network of information resources. The web relies on three mechanisms to make these resources readily av

    2、ailable to the widest possible audience: 1. A uniform naming scheme for locating resources on the web (e.g., URIs). 2. Protocols, for access to named resources over the web (e.g., HTTP). 3. Hypertext, for easy navigation among resources (e.g., HTML).,3,Internet vs. Web,Internet: Internet is a more g

    3、eneral term Includes physical aspect of underlying networks and mechanisms such as email, FTP, HTTP Web: Associated with information stored on the Internet Refers to a broader class of networks, i.e. Web of English Literature Both Internet and web are networks,4,Essential Components of WWW,Resources

    4、: Conceptual mappings to concrete or abstract entities, which do not change in the short term ex: ICS website (web pages and other kinds of files) Resource identifiers (hyperlinks): Strings of characters represent generalized addresses that may contain instructions for accessing the identified resou

    5、rce http:/www.ics.uci.edu is used to identify the ICS homepage Transfer protocols: Conventions that regulate the communication between a browser (web user agent) and a server,5,Standard Generalized Markup Language (SGML),Based on GML (generalized markup language), developed by IBM in the 1960s An in

    6、ternational standard (ISO 8879:1986) defines how descriptive markup should be embedded in a document Gave birth to the extensible markup language (XML), W3C recommendation in 1998,6,SGML Components,SGML documents have three parts: Declaration: specifies which characters and delimiters may appear in

    7、the application DTD/ style sheet: defines the syntax of markup constructs Document instance: actual text (with the tag) of the documents More info could be found: http:/www.W3.Org/markup/SGML,7,DTD Example One, ELEMENT is a keyword that introduces a new element type unordered list (UL) The two hyphe

    8、ns indicate that both the start tag and the end tag for this element type are required Any text between the two tags is treated as a list item (LI),8,DTD Example Two, The element type being declared is IMG The hyphen and the following “O“ indicate that the end tag can be omitted Together with the co

    9、ntent model “EMPTY“, this is strengthened to the rule that the end tag must be omitted. (no closing tag),9,HTML Background,HTML was originally developed by Tim Berners-Lee while at CERN, and popularized by the Mosaic browser developed at NCSA. The Web depends on Web page authors and vendors sharing

    10、the same conventions for HTML. This has motivated joint work on specifications for HTML. HTML standards are organized by W3C : http:/www.w3.org/MarkUp/,10,HTML Functionalities,HTML gives authors the means to: Publish online documents with headings, text, tables, lists, photos, etc Include spread-she

    11、ets, video clips, sound clips, and other applications directly in their documents Link information via hypertext links, at the click of a button Design forms for conducting transactions with remote services, for use in searching for information, making reservations, ordering products, etc,11,HTML Ve

    12、rsions,HTML 4.01 is a revision of the HTML 4.0 Recommendation first released on 18th December 1997. HTML 4.01 Specification:http:/www.w3.org/TR/1999/REC-html401-19991224/html40.txt HTML 4.0 was first released as a W3C Recommendation on 18 December 1997 HTML 3.2 was W3Cs first Recommendation for HTML

    13、 which represented the consensus on HTML features for 1996 HTML 2.0 (RFC 1866) was developed by the IETFs HTML Working Group, which set the standard for core HTML features based upon current practice in 1994.,12,Sample Webpage,13,Sample Webpage HTML Structure, The title of the webpage Body of the we

    14、bpage ,14,HTML Structure,An HTML document is divided into a head section (here, between and ) and a body (here, between and ) The title of the document appears in the head (along with other information about the document) The content of the document appears in the body. The body in this example cont

    15、ains just one paragraph, marked up with ,15,HTML Hyperlink,alumni A link is a connection from one Web resource to another It has two ends, called anchors, and a direction Starts at the “source“ anchor and points to the “destination“ anchor, which may be any Web resource (e.g., an image, a video clip

    16、, a sound bite, a program, an HTML document),16,Resource Identifiers,URI: Uniform Resource Identifiers URL: Uniform Resource Locators URN: Uniform Resource Names,17,Introduction to URIs,Every resource available on the Web has an address that may be encoded by a URI URIs typically consist of three pi

    17、eces: The naming scheme of the mechanism used to access the resource. (HTTP, FTP) The name of the machine hosting the resource The name of the resource itself, given as a path,18,URI Example,http:/www.w3.org/TR There is a document available via the HTTP protocol Residing on the machines hosting www.

    18、w3.org Accessible via the path “/TR“,19,Protocols,Describe how messages are encoded and exchanged Different Layering Architectures ISO OSI 7-Layer Architecture TCP/IP 4-Layer Architecture,20,ISO OSI Layering Architecture,21,ISOs Design Principles,A layer should be created where a different level of

    19、abstraction is needed Each layer should perform a well-defined function The layer boundaries should be chosen to minimize information flow across the interfaces The number of layers should be large enough that distinct functions need not be thrown together in the same layer, and small enough that th

    20、e architecture does not become unwieldy,22,TCP/IP Layering Architecture,23,TCP/IP Layering Architecture,A simplified model, provides the end-to-end reliable connection The network layer Hosts drop packages into this layer, layer routes towards destination Only promise “Try my best” The transport lay

    21、er Reliable byte-oriented stream,24,Hypertext Transfer Protocol (HTTP),A connection-oriented protocol (TCP) used to carry WWW traffic between a browser and a server One of the transport layer protocol supported by Internet HTTP communication is established via a TCP connection and server port 80,25,

    22、GET Method in HTTP,26,Domain Name System,DNS (domain name service): mapping from domain names to IP address IPv4: IPv4 was initially deployed January 1st. 1983 and is still the most commonly used version. 32 bit address, a string of 4 decimal numbers separated by dot, range from 0.0.0.0 to 255.255.2

    23、55.255. IPv6: Revision of IPv4 with 128 bit address,27,Top Level Domains (TLD),Top level domain names, .com, .edu, .gov and ISO 3166 country codes There are three types of top-level domains: Generic domains were created for use by the Internet public Country code domains were created to be used by i

    24、ndividual country The .arpa domain Address and Routing Parameter Area domain is designated to be used exclusively for Internet-infrastructure purposes,28,Registrars,Domain names ending with .aero, .biz, .com, .coop, .info, .museum, .name, .net, .org, or .pro can be registered through many different

    25、companies (known as “registrars“) that compete with one another InterNIC at http:/ Registrars Directory: http:/ Log Files,Server Transfer Log: transactions between a browser and server are logged IP address, the time of the request Method of the request (GET, HEAD, POST) Status code, a response from

    26、 the server Size in byte of the transaction Referrer Log: where the request originated Agent Log: browser software making the request (spider) Error Log: request resulted in errors (404),30,Server Log Analysis,Most and least visited web pages Entry and exit pages Referrals from other sites or search

    27、 engines What are the searched keywords How many clicks/page views a page received Error reports, like broken links,31,Server Log Analysis,32,Search Engines,According to Pew Internet Project Report (2002), search engines are the most popular way to locate information online About 33 million U.S. Int

    28、ernet users query on search engines on a typical day. More than 80% have used search engines Search Engines are measured by coverage and recency,33,Coverage,Overlap analysis used for estimating the size of the indexable web W: set of webpages Wa, Wb: pages crawled by two independent engines a and b

    29、P(Wa), P(Wb): probabilities that a page was crawled by a or b P(Wa)=|Wa| / |W| P(Wb)=|Wb| / |W|,34,Overlap Analysis,P(Wa Wb| Wb) = P(Wa Wb)/ P(Wb) = |Wa Wb| / |Wb| If a and b are independent:P(Wa Wb) = P(Wa)*P(Wb) P(Wa Wb| Wb) = P(Wa)*P(Wb)/P(Wb)= |Wa| * |Wb| / |Wb| = |Wa| / |W|=P(Wa),35,Overlap Ana

    30、lysis,Using |W| = |Wa|/ P(Wa), the researchers found: Web had at least 320 million pages in 1997 60% of web was covered by six major engines Maximum coverage of a single engine was 1/3 of the web,36,How to Improve the Coverage?,Meta-search engine: dispatch the user query to several engines at same t

    31、ime, collect and merge the results into one list to the user. Any suggestions?,37,Web Crawler,A crawler is a program that picks up a page and follows all the links on that page Crawler = Spider Types of crawler: Breadth First Depth First,38,Breadth First Crawlers,Use breadth-first search (BFS) algor

    32、ithm Get all links from the starting page, and add them to a queue Pick the 1st link from the queue, get all links on the page and add to the queue Repeat above step till queue is empty,39,Breadth First Crawlers,40,Depth First Crawlers,Use depth first search (DFS) algorithm Get the 1st link not visited from the start page Visit link and get 1st non-visited link Repeat above step till no no-visited links Go to next non-visited link in the previous level and repeat 2nd step,41,Depth First Crawlers,


    注意事项

    本文(Basic WWW Technologies.ppt)为本站会员(jobexamine331)主动上传,麦多课文档分享仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知麦多课文档分享(点击联系客服),我们立即给予删除!




    关于我们 - 网站声明 - 网站地图 - 资源地图 - 友情链接 - 网站客服 - 联系我们

    copyright@ 2008-2019 麦多课文库(www.mydoc123.com)网站版权所有
    备案/许可证编号:苏ICP备17064731号-1 

    收起
    展开