Dr. Dobb's Journal - May 2008 - (Page 45) d05zeid_p3db 3/13/08 9:35 AM Page 45 increasing correlation values, but were not necessarily signs of plagiarism. Had these results been known upfront, some of them could have been eliminated before CodeMatch was run. However, given the number of files and the number of source-code elements, it was impractical to find these elements before performing the correlation. Also, the correlation itself pointed out many of these superfluous elements. My experience using filtering with CodeMatch can be generalized to any kind of information retrieval process. To understand how filtering can be used, it is important to first understand the different kinds of information retrieval processes and information display methods. Discover a New World in Programming Information Retrieval Since the best-known information retrieval process is probably web searching, I use it as an example. Information retrieval can be classified into two types—exact match and best match. The exact match type of information retrieval is represented by the Boolean retrieval method. In these cases, Boolean equations of keywords are entered by users and all objects in the information domain that meet the criteria are retrieved. Most search engines use exact matching; the information domain is the Web. Even the more sophisticated search engines that let users input natural-language queries are typically parsing the language to retrieve the keywords and Boolean equations. Best-match information retrieval uses vector space and probabilistic retrieval methods that essentially try to understand what information a user wants, sometimes based on past searches or other stored user parameters, and then present the information to the user that seems closest to what the user wants. An example of this would be the book suggestions that Amazon.com presents to customers, based on the customer’s search criteria and past searches. Figure 1 is a graphical representation of information retrieval, where D is the information domain, Q is the user's query, and O is the object retrieved by the query. DU is the subset of the domain that meets the user’s information need based on the retrieval process. Each arrow from an object to the query represents the relationship Ri between the query and object. For all retrieval methods, DU is the set of all objects, such that Ri>0: DU={Oi : Ri > 0 for all i} CodeMatch Post-Process Filtering To make examining the correlation results more useful and let users focus on the kinds of correlation that are most important, I added the ability to filter the results. After CodeMatch produces a database of results, this filtering can be performed on the database: • Statement filtering. A list of statements is created by users. Any correlation due to a statement on this list is eliminated. • Comment filtering. A list of comments is created by users. Any correlation due to a comment on this list is eliminated. • Identifier filtering. A list of identifiers is created by users. Any correlation due to an identifier on this list is eliminated. • General file filtering. A list of file names is created by users. Any correlation between any file whose name appears on the list and any other file is removed from the results database. • Specific file filtering. A list of file names with specific paths is created by users. Any correlation between a specific file on the list and any other file is removed from the results database. • Folder filtering. A list of folders is created by the user. Any correlation between a file in a folder on the list and any other file is removed from the results database. • Threshold filtering. Users can change threshold parameters, reducing the number of correlated file pairs that are displayed. Users can set minimum and maximum correlation scores to display and can set a maximum number of correlated files to display. After the filtering is performed on the database, the correlation scores between file pairs are adjusted accordingly. I found that for large file sets, this filtering reduced the manual process of reviewing the results in order to find plagiarized source-code files from days to hours or even minutes. Looking for a career with an innovative company where you can use your programming skills to make a difference in the world? Join ESRI and design and develop the next generation of our world-leading geographic information system (GIS) mapping software. We are seeking software developers with solid core programming skills and a passion for inventing new technology. We have opportunities to work on everything from database and Web development to graphics, 2D–3D rendering, and core server technology. Our developers work with the latest technologies including Java, .NET, and Flash/Flex. At ESRI you’ll have exceptional benefits, opportunities for growth, and the satisfaction of developing software that is making a difference. Join us and be a part of changing the world. Learn more about ESRI and apply online at www.esri.com/ddj. However, the relationship can be further refined depending on whether the retrieval Copyright © 2007 ESRI. All rights reserved. ESRI, the ESRI globe logo, and www.esri.com, are trademarks, registered trademarks, or service marks of ESRI in the United States, the European Community, or certain other jurisdictions. Other companies and products mentioned herein may be trademarks or registered trademarks of their respective trademark owners. ESRI is an Equal Opportunity Employer. May 2008 l www.ddj.com l Dr. Dobb’s Journal 45 http://Amazon.com http://www.esri.com/ddj http://www.esri.com/ddj http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - May 2008 Dr. Dobb's Journal - May 2008 Contents Friday Night Fish Fry Alia Vox Developer Diaries Software Development Goes to the Movies Cat: A Functional Stack-Based Little Language Mojax: Mobile Ajax Framework Kernel-Mode Databases Getting Better Search Results Effective Concurrency The Agile Edge Dr. Dobb's Journal - May 2008 Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page Cover1) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page Cover2) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page 1) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page 2) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page 3) Dr. Dobb's Journal - May 2008 - Contents (Page 4) Dr. Dobb's Journal - May 2008 - Contents (Page 5) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 6) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 7) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 8) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 9) Dr. Dobb's Journal - May 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - May 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 12) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 13) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 14) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 15) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 16) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 17) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 18) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 19) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 20) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 21) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 22) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 23) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 24) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 25) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 26) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 27) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 28) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 29) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 30) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 31) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 32) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 33) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 34) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 35) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 36) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 37) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 38) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 39) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 40) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 41) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 42) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 43) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 44) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 45) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 46) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 47) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 48) Dr. Dobb's Journal - May 2008 - Effective Concurrency (Page 49) Dr. Dobb's Journal - May 2008 - Effective Concurrency (Page 50) Dr. Dobb's Journal - May 2008 - Effective Concurrency (Page 51) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 52) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 53) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 54) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 55) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 56) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page Cover3) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.