Dr. Dobb's Journal - February 2008 - (Page 38) D02gray_p5ma.qxp 12/7/07 2:26 PM Page 38 Core Technology BIBPORT: CREATING BIBLIOGRAPHIC REFERENCES style, the only means of detecting a book title is whether it’s underlined. If it’s not, the reference certainly is not a book, so BibPort stops parsing. As you can see, this is directly tested by a comparison of a particular attribute of a word against another value within the WOM that represents a single underlining. Bibliography Management Figure 7: Nontextual cues in the WOM. continued from page 36 the Word object to obtain the visual cues used to assist in disambiguating a reference. Using the WOM solves the difficulty of requiring understanding of file formats, but introduces a second problem: Because Word defines the manner in which words and characters are tokenized in documents, there will be several categories of tokens that are not split into the form needed by BibPort’s parsers. One particular instance of this is the distribution of punctuation among tokens. The Words array was designed to be used with English words as its logical unit instead of more discrete units with punctuation being independent. Hence, WOM tends to group punctuation marks with words instead of representing them as separate tokens. Listing One is a sample of the BibPort parsing code. This particular code listing is invoked to determine whether the current word is the beginning of the title for a book. There are several features in this snippet worth noting. First, because BibPort is programmed as a Word AddIn, the various WOM entities are inserted directly into a namespace called ThisAddIn. There is a single Document reference that is available within the Application object that indicates the current working document in Word. A second observation concerns the code’s manner of inspection of the token. In the APA In Word 2007, the WOM has been expanded to allow programmer access to the new Source Manager. However, the interface provided by WOM is not native; inserting a source involves passing a string containing the XML representation of the source. Unfortunately, this schema was not available on the Internet at the time of writing, so we have reverse engineered the tags used to describe a source; see Table 1. The references in the Source Manager can also be found in the XML file in the user’s Application Data directory under Microsoft\Bibliography\Sources.xml. Adding a new reference involves a simple call to the VSTO method called Globals.ThisAddIn.Application.Bibliography.Sources.Add(xml), but this addition can have some level of complexity because the the source’s Tag field must be unique. Due to the COM facilities used in VSTO, exceptions are not trivial. These are typically mapped to an integer and returned as an error code. formats. BibPort hides such complexity transparently behind an abstraction layer to help you in writing text-mining applications in a more generalized manner. The combination of VSTO with VS 2008 removed many of the accidental complexities in processing Word files to extract bibliographic information. Furthermore, Word 2007’s Source Manager provided a convenient repository for this information, and VSTO provided the mechanisms necessary to use this new resource. There are several projects that share similar goals to BibPort. There is a strong need to mine bibliographic citations from documents available from a web search and digital libraries [2]. Perhaps the most well-known example is CiteSeer [3], which is a popular website that understands citations in different formats to allow cross-referencing of research papers. References [1]Andrew McCallum. “Information extraction: Distilling structured data from unstructured text.” ACM Queue, 3(9):48–57, November 2005. [2] Steve Lawrence, C. Lee Giles, and Kurt Bollacker. “Digital libraries and Autonomous Citation Indexing.” IEEE Computer, 32(6):67–71, June 1999. [3]C. Lee Giles, Kurt Bollacker, and Steve Lawrence. “CiteSeer: An automatic citation indexing system.” Third ACM Conference on Digital Libraries, pp. 89–98, Pittsburgh, PA, June 1998. DDJ Conclusion There are many challenges that make text mining difficult [1]. The wide variety of file formats—RTF, .doc, .wpd, and the like—require a textmining application to parse multiple document types, while also being able to output to multiple Listing One Word.Document doc = Globals.ThisAddIn.Application.ActiveDocument; while (doc.Words[index].Underline == word.WdUnderline.wdUnderlineSingle) { if (doc.Words[index].Text.Trim() == "." && doc.Words[index + 1].Underline != Word.WdUnderline.wdUnderlineSingle) break; title += doc.Words[index++].Text; } if (title == "") return false; Table 1: XML tags used to describe a source. 38 Dr. Dobb’s Journal l www.ddj.com l February 2008 http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - February 2008 Dr. Dobb's Journal - February 2008 Contents Hmmmm Alia Vox Developer Diaries Developer’s Notebook South American Software Development Conversations Inside Visual Studio 2008 BibPort: Creating Bibliographic References Continuous LINQ The ZK Framework Static Testing C++ Code The Agile Edge Effective Concurrency Swaine’s Flames Dr. Dobb's Journal - February 2008 Dr. Dobb's Journal - February 2008 - Dr. Dobb's Journal - February 2008 (Page Cover1) Dr. Dobb's Journal - February 2008 - Dr. Dobb's Journal - February 2008 (Page Cover2) Dr. Dobb's Journal - February 2008 - Dr. Dobb's Journal - February 2008 (Page 1) Dr. Dobb's Journal - February 2008 - Dr. Dobb's Journal - February 2008 (Page 2) Dr. Dobb's Journal - February 2008 - Dr. Dobb's Journal - February 2008 (Page 3) Dr. Dobb's Journal - February 2008 - Contents (Page 4) Dr. Dobb's Journal - February 2008 - Contents (Page 5) Dr. Dobb's Journal - February 2008 - Hmmmm (Page 6) Dr. Dobb's Journal - February 2008 - Hmmmm (Page 7) Dr. Dobb's Journal - February 2008 - Hmmmm (Page 8) Dr. Dobb's Journal - February 2008 - Hmmmm (Page 9) Dr. Dobb's Journal - February 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - February 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - February 2008 - Developer Diaries (Page 12) Dr. Dobb's Journal - February 2008 - Developer Diaries (Page 13) Dr. Dobb's Journal - February 2008 - Developer’s Notebook (Page 14) Dr. Dobb's Journal - February 2008 - Developer’s Notebook (Page 15) Dr. Dobb's Journal - February 2008 - South American Software Development (Page 16) Dr. Dobb's Journal - February 2008 - South American Software Development (Page 17) Dr. Dobb's Journal - February 2008 - South American Software Development (Page 18) Dr. Dobb's Journal - February 2008 - South American Software Development (Page 19) Dr. Dobb's Journal - February 2008 - Conversations (Page 20) Dr. Dobb's Journal - February 2008 - Conversations (Page 21) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 22) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 23) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 24) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 25) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 26) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 27) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 28) Dr. Dobb's Journal - February 2008 - Inside Visual Studio 2008 (Page 29) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 30) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 31) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 32) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 33) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 34) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 35) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 36) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 37) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 38) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 39) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 40) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 41) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 42) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 43) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 44) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 45) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 46) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 47) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 48) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 49) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 50) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 51) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 52) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 53) Dr. Dobb's Journal - February 2008 - BibPort: Creating Bibliographic References (Page 54) Dr. Dobb's Journal - February 2008 - Continuous LINQ (Page 55) Dr. Dobb's Journal - February 2008 - Continuous LINQ (Page 56) Dr. Dobb's Journal - February 2008 - Continuous LINQ (Page 57) Dr. Dobb's Journal - February 2008 - Continuous LINQ (Page 58) Dr. Dobb's Journal - February 2008 - Continuous LINQ (Page 59) Dr. Dobb's Journal - February 2008 - The ZK Framework (Page 60) Dr. Dobb's Journal - February 2008 - The ZK Framework (Page 61) Dr. Dobb's Journal - February 2008 - The ZK Framework (Page 62) Dr. Dobb's Journal - February 2008 - The ZK Framework (Page 63) Dr. Dobb's Journal - February 2008 - The ZK Framework (Page 64) Dr. Dobb's Journal - February 2008 - The ZK Framework (Page 65) Dr. Dobb's Journal - February 2008 - Static Testing C++ Code (Page 66) Dr. Dobb's Journal - February 2008 - Static Testing C++ Code (Page 67) Dr. Dobb's Journal - February 2008 - Static Testing C++ Code (Page 68) Dr. Dobb's Journal - February 2008 - Static Testing C++ Code (Page 69) Dr. Dobb's Journal - February 2008 - Static Testing C++ Code (Page 70) Dr. Dobb's Journal - February 2008 - The Agile Edge (Page 71) Dr. Dobb's Journal - February 2008 - The Agile Edge (Page 72) Dr. Dobb's Journal - February 2008 - The Agile Edge (Page 73) Dr. Dobb's Journal - February 2008 - Effective Concurrency (Page 74) Dr. Dobb's Journal - February 2008 - Effective Concurrency (Page 75) Dr. Dobb's Journal - February 2008 - Effective Concurrency (Page 76) Dr. Dobb's Journal - February 2008 - Effective Concurrency (Page 77) Dr. Dobb's Journal - February 2008 - Effective Concurrency (Page 78) Dr. Dobb's Journal - February 2008 - Effective Concurrency (Page 79) Dr. Dobb's Journal - February 2008 - Swaine’s Flames (Page 80) Dr. Dobb's Journal - February 2008 - Swaine’s Flames (Page Cover3) Dr. Dobb's Journal - February 2008 - Swaine’s Flames (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.