Building a Terminological Database from Heterogeneous .ppt
《Building a Terminological Database from Heterogeneous .ppt》由会员分享,可在线阅读,更多相关《Building a Terminological Database from Heterogeneous .ppt(36页珍藏版)》请在麦多课文档分享上搜索。
1、May 21, 2003,Building a Terminological Database from Heterogeneous Definitional Sources,Smaranda Muresan, Peter T. Davis Samuel D. Popper, Judith L. KlavansColumbia University,Why Terminology is Important?,Each agency and each department might have different ways to define the same conceptWorking wi
2、th multiple databases requires understanding the data across multiple agencies and domains,Whats an Employee?,An appointed officer or employee of USDA including special Government employees (collaborators, consultants and panel members). The term excludes independent contractors.,An individual who i
3、s engaged or compensated by a railroad or by a contractor to a railroad, who is authorized by a railroad to use its wireless communications in connection with railroad operations.,A person who works for wages or salary in the service of an employer.,The term “employee“ does not include a director, t
4、rustee, or officer.,US Department of Agriculture,Federal Railroad Administration,Mine Safety and Health Administration,US SEC,Desiderata for Terminological Resources,Capture the ongoing evolution of languageProvide consistency, ease of sharing and integration across agencies.,Architecture,Collection
5、,SemanticAnalysis,Use,Heterogeneous Definitional Corpus,GetGloss,Definder,ParseGloss,Database Building,Terminological Database,dynamic sources,relations among concepts and their attributes,fast access, flexibility, sharing,database query,Building the terminological DB,Collection,MotivationDefinition
6、s are rich in terminological knowledgeOn-line dictionaries are static and generally incompleteNeed to capture the evolution of language,Acquisition of Heterogeneous Definitional Corpus,GetGloss,Definder,Solution GetGloss identification and extraction of glossariesDefinder - extraction of definitions
7、 from online free text,Building the Terminological DB,Motivation Need to identify relationships among concepts e.g. synonyms, hypernyms, cross-referenceNeed to store this conceptual information for easy and fast access and integration,Semantic Analysis,ParseGloss,Database Building,Terminological Dat
8、abase,Definitional Corpus,Gasoline: See Motor Gasoline (Finished). Motor Gasoline (Finished): A complex mixture of relatively volatile hydrocarbons with or without small quantities of additives, blended to form a fuel suitable for use in spark-ignition engines. Motor gasoline, as defined in ASTM Spe
9、cification D 4814 or Federal Specification VV-G-1690C, is characterized as having a boiling range of 122 to 158 degrees Fahrenheit at the 10 percent recovery point to 365 to 374 degrees Fahrenheit at the 90 percent recovery point. “Motor Gasoline“ includes conventional gasoline; all types of oxygena
10、ted gasoline, including gasohol; and reformulated gasoline, but excludes aviation gasoline.,Term: Motor Gasoline (Finished)Source: (source (agency “EIA“) (resource “Gasoline Glossary“) (url ) Paren-modifier: FinishedFull Definition: A complex mixture. for use in . Core Definition: A complex mixture
11、. for use in spark-ignition enginesGenus Phrase: A complex mixture of relatively volatile hydrocarbonsHead Genus Word: hydrocarbonsProperties:UsedIn:spark-ignition enginesExcludes-Includes:includes conventional gasolineincludes gasoholexcludes aviation gasoline .,Solution Transform definitional text
12、 into conceptual data ParseGloss partial semantic analysis of definitions to identify relations between concepts Store data into a relational database,Database Use,MotivationEnable the user to access the richness of terminological knowledge Assure easy and fast access to data Enable data sharing and
13、 integration across agencies Enable dynamic update of data,Terminological Database,SQL query for inflammation1. Redness, swelling, heat and pain resulting from injury to tissue (parts of the body underneath the skin). Also known as swelling. 2. A characteristic reaction of tissues to disease or inju
14、ry; it is marked by four signs: swelling, redness, heat, and pain. 3. The reaction of tissue to injury . 4. A response to irritation , infection , or injury , resulting in pain , redness , and swelling . ,Solution Query module for the relational database (SQL),Putting It All Together,Collection,Sema
15、nticAnalysis,User,Heterogeneous Definitional Corpus,GetGloss,Definder,ParseGloss,Database Building,Terminological Database,dynamic sources,relations among concepts and their attributes,fast access, flexibility, sharing,database query,Building the terminological DB,GetGloss Automatic Glossary Extract
16、ion,DGRC project Given a URL find the glossary file Challenges:glossaries can constitute small parts of a web page, being embedded inside there is no standard HTML tag formatting for marking pairs a web page can contain pairs, where information is not a definition.,True Positive,False Positive,Algor
17、ithm,Two step algorithm Identification Component Find candidate glossaries Keyword + Rule-based algorithm (6 rules) “glossary”, “dictionary” in HTML tags Terms in alphabetical order Classification Component Filter out false positives Rule-based method (9 rules)e.g filter if term is a Named Entity (e
18、.g California) Statistical method using SVM,Evaluating the Identification Component,10,000+ pages from 5 different sites 1,000 page sample: no glossaries (n=13) 286,579 page sample from 268 domains P=53% Estimating recall is hard Precision and Recall both very sensitive to perturbations (p=0 vs. p=5
19、3%),Klavans et. al (dg.o 2002),Evaluating the Classification Components,GetGloss Categorizer assigns a score to each candidate based on a linear combination of weighted featuresCorpus: 2400 glossary candidatesTest: 300 randomly chosen, manually categorized glossary candidates,Classification Componen
20、t Performance,0 if Score -100 1 if -100 Score -50 2 if -50 Score 0,3 if 0 Score 50 4 if 50 Score 100 5 if 100 Score,Definder- Automatic Extraction of Definitions from Text,Definder- Automatic Extraction of Definitions from Text,Definder,Part of NSF funded digital library project Medical domain Extra
21、ct definitions from consumer oriented medical text Corpus Medical articles written by doctors for lay audience Different genre (articles, manual chapters),Algorithm,Shallow parsing Simple definitions (e.g NP-NP pairs for synonyms) Candidate complex definitionsFull parsing (Charniak00 parser) apposit
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- BUILDINGATERMINOLOGICALDATABASEFROMHETEROGENEOUSPPT

链接地址:http://www.mydoc123.com/p-379141.html