Supporting Annotation Layers for Natural Language Processing.ppt
《Supporting Annotation Layers for Natural Language Processing.ppt》由会员分享,可在线阅读,更多相关《Supporting Annotation Layers for Natural Language Processing.ppt(37页珍藏版)》请在麦多课文档分享上搜索。
1、Supporting Annotation Layers for Natural Language Processing,Archana Ganapathi, Preslav Nakov, Ariel Schwartz, and Marti Hearst Computer Science Division and SIMS University of California, Berkeley,Motivation,Most natural language processing (NLP) algorithms make use of the results of previous proce
2、ssing steps, e.g.: Tokenizer Part-of-speech tagger Phrase boundary recognizer Syntactic parser Semantic tagger No standard way to represent, store and retrieve text annotations efficiently. MEDLINE has close to 13 million abstracts. Full text starts to become available as well.,Text Annotation Frame
3、work,Annotations are stored independently of text in an RDBMS Declarative query language for annotation retrieval Indexing structure designed for efficient query processing Object Oriented API for annotations: insertion, deletion and modification,Key Contributions,Support for hierarchical and overla
4、pping layers of annotation Querying multiple levels of annotations simultaneously First to evaluate different physical database designs Focused on scaling annotation-based queries to very large corpora with many layers of annotations We propose a query language and demonstrate its power and the effi
5、ciency of the indexing architecture on a wide variety of query types that have been published in the NLP literature.,Outline,Related Work Layered Query Language Database Design API Evaluation Conclusions,Related Work,Annotation graphs (AG): directed acyclic graph; nodes can have time stamps or are c
6、onstrained via paths to labeled parents and children. (Bird and Liberman, 2001) Emu system: sequential levels of annotations. Hierarchical relations may exist between different levels, but must be explicitly defined for each pair.(Cassidy supports set operations. (Nenadic et al., 2002),Outline,Relat
7、ed Work Layered Query Language Database Design API Evaluation Conclusions,Layers of Annotations,Layers of Annotations,Layers of Annotations,Layers of Annotations,Full parse, sentence and section layers are not shown.,Layers of Annotation (cont.),Each annotation represents an interval spanning a sequ
8、ence of characters absolute start and end positionsEach layer corresponds to a conceptually different kind of annotation i.e., word, gene/protein, shallow parse can have several layers with the same semantics Layers can be sequential overlapping e.g., two multiple-word concepts sharing a word hierar
9、chical spanning, when the intervals are nested as in a parse tree, or ontologically, when the token itself is derived from a hierarchical ontology,Layer Type Properties,One-to-one correspondence between the Word and the Part-of-speech (POS) layers. The Word, POS and Shallow parse layers are sequenti
10、alThe Full parse layer is spanning hierarchical The Gene/protein layer assigns IDs from the LocusLink database of gene names many-to-one in the case of multiple species The Ontology layer assigns terms from the hierarchical medical ontology MeSH (Medical Subject Headings) Overlapping (share the word
11、 cell) and hierarchical: both spanning, since blood cell (with MeSH ID D001773) spans cell (which is also in MeSH), and ontologically, since blood cell is a kind of cell and cell death (D016923) is a type of Biological Phenomena.,Layered Query Language,Requirements for the query language on layers o
12、f annotations: Intuitive Compact Declarative Expressive power for real world queries Support for hierarchical and overlapping annotations Compatible with SQL LQL (Layered Query Language) XML-like Can be translated to SQL to run against an RDBMS Tested on real world bioscience NLP applications,LQL by
13、 Example,A01 A07 limb:vein shoulder: artery,LQL Syntax,“” Defines an arbitrary range over text. A range is typically restricted to a specific layer type using . All layers have a lex (the text spanned by the range) and a tag_type attribute. Predicates on attribute values are enclosed in square brack
14、ets, i.e. “ | = | ”. The language supports the boolean operators conjunction ( can be used to descend an ontological hierarchy.,Additional LQL Features,For spanning hierarchical layers we can have hierarchical queries with several nested references to the same layer. The following query finds a PP o
15、f the form preposition+NP and prints that NP: print $ The keyword noorder allows an arbitrary order for the tokens within a range, e.g.: print sentence The language allows for a combination of ordered and unordered constraints. For example, ( ) print sentence LQL currently does not support a range o
16、verlap operator.,LQL and SQL,LQL can be automatically translated into SQL (although this is not yet implemented), as: user-defined function, or a macro The result of an LQL query is a relation Thus, allowing the use of standard SQL syntax such as GROUP BY, COUNT, DISTINCT, ORDER BY, UNION etc. An ad
17、ded advantage of LQL over SQL is that the LQL queries do not need to be modified, if the underlying logical design is changed. LQL is still a work in progress; We plan to assess it via usability studies with computational linguistics researchers, modifying it as necessary. However, we feel it is mor
18、e intuitive and easier to use for text processing than the existing languages.,LQL Versus SQL,Outline,Related Work Layered Query Language Database Design API Evaluation Conclusions,Database Design,We evaluated 5 different logical and physical database designs. The basic model is similar to the one o
19、f TIPSTER (Grishman, 1996). Each annotation is stored as a record in a relation. Architecture 1 contains the following columns: docid: document ID; section: title, abstract or body text; layer_id: a unique identifier of the annotation layer; start_char_pos: starting character position, relative to p
20、articular section and docid; end_char_pos: end character position, relative to particular section and docid; tag_type: a layer-specific token unique identifier. There is a separate table mapping token IDs to entities (the string in case of a word, the MeSH label(s) in case of a MeSH term etc.),Datab
21、ase Design (cont.),Architecture 2 introduces one additional column, sequence_pos, thus defining an ordering for each layer. Simplifies some SQL queries as there is no need for “NOT EXISTS” self joins, which are required under Architecture 1 in cases where tokens from the same layer must follow each
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- SUPPORTINGANNOTATIONLAYERSFORNATURALLANGUAGEPROCESSINGPPT

链接地址:http://www.mydoc123.com/p-389498.html