Data Science.ppt
《Data Science.ppt》由会员分享,可在线阅读,更多相关《Data Science.ppt(21页珍藏版)》请在麦多课文档分享上搜索。
1、Data Science,Topics,databases and data architectures databases in the real world scaling, data quality, distributed machine learning/data mining/statistics information retrieval,Data Science is currently a popular interest of employers our Industrial Affiliates Partners say there is high demand for
2、students trained in Data Science databases, warehousing, data architectures data analytics statistics, machine learning Big Data gigabytes/day or more Examples: Walmart, cable companies (ads linked to content, viewer trends), airlines/Orbitz, HMOs, call centers, Twitter (500M tweets/day), traffic su
3、rveillance cameras, detecting fraud, identity theft. supports “Business Intelligence” quantitative decision-making and control finance, inventory, pricing/marketing, advertising need data for identifying risks, opportunities, conducting “what-if” analyses,Data Architectures,traditional databases (CS
4、CE 310/608) tables, fields tuples = records or rows key = field with unique values can be used as a reference from one table into another important for avoiding redundancy (normalization), which risks inconsistency join combining 2 tables using a key metadata data about the data names of the fields,
5、 types (string, int, real, mpeg.) also things like source, date, size, completeness/sampling,Instructors:,TeachingAssignments:,Courses:,SQL: Structured Query Language SELECT Name,HomeTown FROM Instructors WHERE PhDSELECT Course,Title FROM Courses ORDER BY Course; CSCE 121 Introduction to Computing i
6、n C+ CSCE 206 Programming in C CSCE 314 Programming Languages CSCE 411 Design and Analysis of Algorithmscan also compute sums, counts, means, etc.example of JOIN: find all courses taught by someone from CMU:SELECT TeachingAssignments.Course FROM Instructors JOIN TeachingAssignmentsON Instructors.Nam
7、e=TeachingAssigmnents.Name WHERE Instructor.PhD=“Carnegie Mellon” CSCE 314 CSCE 206 because they were both taught by Bill Jones,SQL servers centralized database, required for concurrent access by multiple users ODBC: Open DataBase Connectivity protocol to connect to servers and do queries, updates f
8、rom languages like Java, C, Python Oracle, IBM DB2 - industrial strength SQL databases,some efficiency issues with real databases indexing how to efficiently find all songs written by Paul Simon in a database with 10,000,000 entries? data structures for representing sorted order on fields disk manag
9、ement databases are often too big to fit in RAM, leave most of it on disk and swap in blocks of records as needed could be slow concurrency transaction semantics: either all updates happen en batch or none (commit or rollback) like delete one record and simultaneously add another but guarantee not t
10、o leave in an inconsistent state other users might be blocked till done query optimization the order in which you JOIN tables can drastically affect the size of the intermediate tables,Unstructured data raw text documents, digital libraries grep, substring indexing, regular expressions like find all
11、 instances of “aAg+ies” including “agggggies” Information Retrieval (CSCE 470) look for synonyms, similar words (like “car” and “auto”) tfIdf (term frequency/inverse doc frequency) weighting for important words LSI (latent semantic indexing) e.g. dogs is similar to canines because they are used simi
12、larly (both near bark and bite) Natural Language parsing extracting requirements from jobs postings,Unstructured data images, video (BLOBs=binary large objects) how to extract features? index them? search them? color histograms convolutions/transforms for pattern matching looking for ICBM missiles i
13、n aerial photos of Cuba streams sports ticker, radio, stock quotes. XML files with tags indicating field namesCSCE 411Design and Analysis of Algorithms,Object databases,CHEM 102 Intro to Chemistry TR, 3:00-4:00 prereq: CHEM 101,Texas A&M College Station, TX Div 1A 53,299 students,Dr. Frank Smith 302
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- DATASCIENCEPPT
