The MCDC Data Archive.ppt
《The MCDC Data Archive.ppt》由会员分享,可在线阅读,更多相关《The MCDC Data Archive.ppt(59页珍藏版)》请在麦多课文档分享上搜索。
1、The MCDC Data Archive,John Blodgett Office of Social & Economic Data Analysis University of Missouri Rev. May 2007http:/mcdc.missouri.edu/tutorials/mcdc_data_archive.ppt,A Brief History of the Archive,Started by the Urban Information Center (UIC) at UM St. Louis (UMSL), circa 1981. Accessing census
2、data files (“STF”s huge sequential summary files on tape) was very tedious and error-prone. Idea was to standardize the data and make it easier, cheaper and more reliable to access. SAS software package was becoming the tool for accessing the data.,Brief History (cont),Idea was to create an organize
3、d collection of datasets with certain standardization. E.g. A FIPS county code field would always be converted to a SAS variable named County and would be stored as a 3-character field (NOT a numeric) with leading 0s. STFs with thousands of records would be partitioned into smaller datasets based on
4、 geographic summary units (counties, tracts, places, etc.),Brief History (cont),Very informal “database” concept. Users were 3 SAS programmers at UIC using MVS (IBM mainframe). No web access and no end-user access to worry about. A database designed for easy and efficient analysis and ad-hoc queries
5、. The data was almost entirely (decennial) Census data. We developed SCADS SAS Census Access and Display System. Sold 8 copies. Only ran on IBM mainframe systems (MVS) with SAS.,Brief History: 1988,In 1988 the UIC and OSEDA (UM-Extension at Columbia) team up to become data support for the Missouri C
6、ensus Data Center. OSEDA has a wider variety of data that is to be added to the collection (archive). OSEDA has data analysts who are not SAS programmers. Lotus 1-2-3 is very big. Storing metadata (documentation) in pendaflex-based system no longer as viable as when it was just “us guys”.,Brief Hist
7、ory: 1991-1992,The 1990 Census results are flowing. The UIC is converting all the files to SAS datasets, mostly on tape. Data on disk is very expensive on the MVS system. The Census Bureau is releasing the data on CDs along with some extraction software. These are the DOS ages. To access an STF3 tab
8、le for Poplar Bluff requires mounting a tape and reading it sequentially to find the relevant data, paying for tape I/Os required to get there. Slow, expensive and hard to estimate the cost of a query.,Brief History: 1993,Breakthrough year. COIN (Columbia Online Information Network) and Gopher becom
9、e important elements of the MSCDC. The UICs standard extract reports based on STF3 are turned into very simple but very popular 1 or 2-page demographic profile reports. Delivered via the Internet using the Gopher protocol. This required copying the report files to a Unix system at OSEDA. But the dat
10、a and most processing are still on MVS mainframe.,Brief History: 1994-1996,Transition years. (Most) archive data are copied to an AIX (IBM Unix) system. This was the Great Leap Forward for the archive. The web takes off. Windows 95 appears. Suddenly it seems like everybody has MS-Office with Excel.
11、First version of Uexplore debuts in 1996 with “sub-applications” xtract, hypercon and tabrgen. It allows users to explore the data archive and do extractions. Targeted for use by the state data center core group & affiliates.,Brief History: 2001-2003,Archive moves to new hardware system with storage
12、 and processing speed to handle 2k decennial census. Dexter replaces old xtract modules. Hypercon & tabrgen are retired. Metadata system based on “datasets dataset” developed, with Datasets.html index pages. Enhancements designed to make archive more “self service” oriented.,Relevance of History to
13、DA,It was not until the mid-90s that the data archive was made end-user-accessible via the web. Even then it was for a more sophisticated user, not a casual 1-time user. The advent of the WWW resulted in much more emphasis on making datasets easier to use and on creating metadata. The widespread use
14、 of Excel led us to concentrate on creating extracts that could be easily loaded into spreadsheets. There are still “filetypes” in the archive that pre-date the web and these are generally not as accessible as those created after we started worrying about web-access issues.,What Is the Data Archive?
15、,A loosely organized collection of data files (data sets, data tables, SAS data sets - these are all terms for the same thing). Related supporting files in html, pdf, csv, xls and other standard web formats. Such files may contain metadata, extracts, raw input data, reports, etc. A reasonably rigoro
16、us set of naming and organizational conventions that make accessing the data easier. A network of MCDC people who will assist you with accessing the data.,Data Archive Directories,The archive is really just a very large Unix directory. It is named /pub/data . The 1st level subdirectories represent d
17、ata categories that we call “filetypes”. All filetypes have a subdirectory named Tools where we keep the SAS programs that created the data sets in the filetype directory. Occasionally we have subdirectories of filetype directories that contain data files. We do this to avoid having too many data se
18、ts in 1 directory.,Uexplore and Directories,The Uexplore navigation utility displays the contents of a single directory. It lists subdirectories, data files and other files. Subdirectories (identified via folder icons) are listed before most files (special files like Datasets.html & Readme.html are
19、the only ones that appear before subdirectories). Clicking on a subdirectory invokes Uexplore to display the contents of that subdirectory.,Files and Data Files,The directories are simply containers for organizing the content of the DA, which is comprised of files. “Data Files” is the term we use to
20、 reference the special files that can be accessed via the Dexter extraction utility. AKA “data sets” & “SAS data sets”. Uexplore displays a listing of all the files within a directory in alphabetical order, with the filenames serving as hyperlinks. In Unix, case matters and uppercase letters sort be
21、fore lowercase.,File Naming Conventions,File extensions determine what happens when you select (click on) a file on the uexplore-generated web page. Extensions sas7bdat and sas7bvew indicate data files. Clicking invokes Dexter to extract from that data set. Extension sas indicates a SAS code file. I
22、t will display as a text file in your browser. Most other extensions (html, pdf, csv, txt, etc) will be displayed as usual by your browser. E.g. for most users clicking on a file with a “.csv” extension will cause Excel to be invoked.,File Naming Conventions,Many data sets pertain to a specific geog
23、raphic universe. In these cases we commonly use a filename that identifies this universe such as “mo” (for Missouri) or “us” (for United States). A file name that ends with 2 digits usually indicates data pertaining to a year. So file mocom06.sas7bdat contains data for 2006.,File Naming Conventions
24、(cont),We sometimes use geographic levels as part of file names to indicate the level(s) of geography being summarized on the set. E.g. mostcnty is a file containing summaries for Missouri state and counties. uszips04 would indicate ZIP code level summaries for the entire U.S. for 2004.,Datasets.htm
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- THEMCDCDATAARCHIVEPPT
