Dr. Dobb's Journal - May 2008 - (Page 44) d05zeid_p3db 3/13/08 9:35 AM Page 44 State of the Art by Bob Zeidman Getting Better Search Results Bob is the president of Zeidman Consulting. He can be reached at Bob@ZeidmanConsulting.com. Human-aided filtering can make the difference Search engines are great. Put in keywords and out pop hundreds, thousands, sometimes millions of web pages. But then what? How can you effectively look at all of those pages? Maybe it’s time to put people back in the equation. After all, we can still do a few things better than computers, like quickly filtering out irrelevant information. With the right tools, the computer can even help us do this more efficiently. In this article (which is based on a more detailed paper that was presented at the 11th World MultiConference on Systemics, Cybernetics and Informatics), I use human-aided filtering to focus in on useful information. I have incorporated humanaided filtering into a tool for finding software plagiarism. After the tool finds similar sections of code in two programs, the human and the computer work together to pinpoint the results that are most relevant. one deficiency with the other tools—too much output. After examining the results, I often found information that was not relevant to the particular case on which I was working. Because a large comparison could take a week for results, it was impractical to rerun the comparison using new settings. I began to spend time manually filtering the results to obtain a more manageable and more relevant set of results. The main purpose of CodeMatch was to reduce the time I spent looking at lines of code. While it did reduce my time by at least an order of magnitude from manually examining code files, I now wanted to reduce the time I spent poring over the results. (My wife thinks this is a bit crazy since I get paid per hour.) CodeMatch For the past decade, I’ve been an expert witness in intellectual property cases and asked to examine software source code from a plaintiff or defendant to determine whether one has plagiarized code from the other. Over time, I’ve found that the few existing tools for plagiarism detection were too inaccurate for situations where hundreds of millions of dollars could be at stake. Consequently, I developed my own tool called “CodeMatch” (www.safe-corp.biz/products_codesuite.htm). CodeMatch uses four algorithms to determine the correlation between source-code files for different programs: • Statement Correlation. A measure of the number of identical statements. • Comment Correlation. A measure of the number of identical comments. • Identifier Correlation. A measure of the number of identical and nearly identical identifiers. • Instruction Sequence Correlation. A measure of the longest sequence of identical instructions. Superfluous Results In reviewing the results of the comparison, often some specific files or specific source-code elements would show up throughout the results, skewing results and hiding important correlation information. For example, open-source files may have been used in one or both sets of files. In searching for plagiarized code, the open-source files would be highly correlated with each other, but these correlations were not important. Pieces of these files would show up throughout both sets of files and flagged as highly correlated. Similarly, there were specific statements, comments, and identifiers that showed up in many places, but were not relevant to finding plagiarized code. Users searching for plagiarized code may find that two programs running on Linux both use the same system calls. Thus, files with these system calls will have a higher correlation. Common identifier names like “index,” “count,” and “result” showed up in many files, After using CodeMatch on a number of cases, I found that although it had great accuracy, it shared 44 Dr. Dobb’s Journal l www.ddj.com l May 2008 http://www.safe-corp.biz/products_codesuite.htm http://www.ddj.com
Table of Contents Feed for the Digital Edition of Dr. Dobb's Journal - May 2008 Dr. Dobb's Journal - May 2008 Contents Friday Night Fish Fry Alia Vox Developer Diaries Software Development Goes to the Movies Cat: A Functional Stack-Based Little Language Mojax: Mobile Ajax Framework Kernel-Mode Databases Getting Better Search Results Effective Concurrency The Agile Edge Dr. Dobb's Journal - May 2008 Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page Cover1) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page Cover2) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page 1) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page 2) Dr. Dobb's Journal - May 2008 - Dr. Dobb's Journal - May 2008 (Page 3) Dr. Dobb's Journal - May 2008 - Contents (Page 4) Dr. Dobb's Journal - May 2008 - Contents (Page 5) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 6) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 7) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 8) Dr. Dobb's Journal - May 2008 - Friday Night Fish Fry (Page 9) Dr. Dobb's Journal - May 2008 - Alia Vox (Page 10) Dr. Dobb's Journal - May 2008 - Alia Vox (Page 11) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 12) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 13) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 14) Dr. Dobb's Journal - May 2008 - Developer Diaries (Page 15) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 16) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 17) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 18) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 19) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 20) Dr. Dobb's Journal - May 2008 - Software Development Goes to the Movies (Page 21) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 22) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 23) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 24) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 25) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 26) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 27) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 28) Dr. Dobb's Journal - May 2008 - Cat: A Functional Stack-Based Little Language (Page 29) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 30) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 31) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 32) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 33) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 34) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 35) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 36) Dr. Dobb's Journal - May 2008 - Mojax: Mobile Ajax Framework (Page 37) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 38) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 39) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 40) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 41) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 42) Dr. Dobb's Journal - May 2008 - Kernel-Mode Databases (Page 43) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 44) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 45) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 46) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 47) Dr. Dobb's Journal - May 2008 - Getting Better Search Results (Page 48) Dr. Dobb's Journal - May 2008 - Effective Concurrency (Page 49) Dr. Dobb's Journal - May 2008 - Effective Concurrency (Page 50) Dr. Dobb's Journal - May 2008 - Effective Concurrency (Page 51) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 52) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 53) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 54) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 55) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page 56) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page Cover3) Dr. Dobb's Journal - May 2008 - The Agile Edge (Page Cover4)
For optimal viewing of this digital publication, please enable JavaScript and then refresh the page. If you would like to try to load the digital publication without using Flash Player detection, please click here.