BioLINK Talks.ppt
《BioLINK Talks.ppt》由会员分享,可在线阅读,更多相关《BioLINK Talks.ppt(18页珍藏版)》请在麦多课文档分享上搜索。
1、BioLINK Talks,BioLINK,Detroit, June 24 (Edinburgh July 11),Linking Literature, Information and Knowledge for Biology,Corpora and Corpus design (2)NER and Term Normalisation (3)Annotation and Zoning (2)Relation Extraction (2)Other,Corpus Design for Biomedical Natural Language Processing,K. Bretonnel
2、Cohen et al (U of Colorado),Main Question: why are some (bio-)corpora more used than others? What makes them attractive?,Crucial points:,Take home message: if you want people to use your corpus, use XML, publish annotation guidelines, publicise corpus with dedicated papers, use it for competitions,f
3、ormat: XMLcode several layers of informationpublicity: write specific papers about corpus, publicise its availability,Corpora and corpus design,MedTag: a collection of biomedical annotations,L. Smith et al. (National Center for Biotechnology Information, Bethesda, Maryland),Main Point: MedTag is a d
4、atabase that combines three corpora:,Take home message: integrated data, more accessible, you should try it.,Corpora and corpus design,MedPost (modified to include 1000 extra sentences)ABGeneGENETAG (modified to reflect new defs of genes and prots),The data is available in flat files + software to f
5、acilitate loading data into SQL database,MedPost,6700 sentencesannotated for POS and gerund argumentsPOS tagger trained on it (97.4% accuracy),GENETAG,15000 sentences currently released tagged for gene/protein identificationused in Biocreative,ABGene,over 4000 sentencesannotated for gene/protein nam
6、esNER tagger trained on it (lower 70s),Corpora and corpus design,GOOD,BAD,Recommended Usestraining and evaluatingPOS taggerstraining and evaluatingNER taggersdeveloping and evaluatinga chunker (for PubMedphrase indexing)analysis of grammatical usage in medical textfeature extraction for MLentity ann
7、otation guidelines,tokenisation! (white spaces were deleted),Corpora and corpus design,NER and TN,Weakly Supervised Learning Methods for Improving the Quality of Gene Name Normalization,Ben Wellner (MITRE),1. presenting method of improving quality of training data from BioCreative task1b. Systems pe
8、rformance on improved data is better than on original data2. weakly supervised methods can be successfully appliedfor re-labeling noisy training data,Main points,(next week),NER and TN,Unsupervised gene/protein normalization using automatically extracted dictionaries,A. Cohen (Oregon Health & Scienc
9、e U., Portland, Oregon),Main point: dictionary-based gene and protein NER and normalisation system; no supervised training; no human intervention.,what curated databases are the best collections of names?are simple rules sufficient for generating ortographic variants?can common English words be used
10、 to decrease false positives?what is the normalization performance of a dictionary-based approach?,Results: near state-of-the-art; saving on annotation,METHOD,1. Building the dictionary,2. Generating orthographic variants,3. Separating common English words,4. Screening out most common English words,
11、5. Searching the text,6. Disambiguation,Automatically extracted from 5 databases: official symbol, Unique identifiers, name, symbol, synonym, alias fields,Set of 7 simple rules applied iteratively,Dictionary split in two parts: confusion and main dictionary,Note: 5% ambiguous intra-species; 85% acro
12、ss species. Exploit non-ambiguous synonyms; exploit context,NER and TN,NER and TN,A machine learning approach to acronym generation,Tsuruoka et al (Tokyo (Tsujii group), Japan and Salford, UK),Task: system generates possible acronyms from a given expanded form,Method: ML approach (MaxEnt Markov Mode
13、l),Main point: acronym generation as sequence tagging problem,Experiments:- 1901 definition/acronym pairs- several ranked options as output- 75.4% coverage when including top 5 candidates- baseline: take first letters and capitalise them,Classes (tags)1. SKIP (generator skips the letter) 2. UPPER (g
- 1.请仔细阅读文档,确保文档完整性,对于不预览、不比对内容而直接下载带来的问题本站不予受理。
- 2.下载的文档,不会出现我们的网址水印。
- 3、该文档所得收入(下载+内容+预览)归上传者、原创作者;如果您是本文档原作者,请点此认领!既往收益都归您。
下载文档到电脑,查找使用更方便
2000 积分 0人已下载
下载 | 加入VIP,交流精品资源 |
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- BIOLINKTALKSPPT
