Login | Register
My pages Projects Community openCollabNet

Project home

If you were registered and logged in, you could join this project.

Summary Source code content search engine
Categories None
License GNU General Public License
Owner(s) ghaffar2, jkawah, kmeyer, kojuharo, mailloux, peerenboom_dave

CodeCrawler will be a system, geared toward developers, which makes searching source code as easy as searching the Internet and as powerful as using a grep tool. It will provide a web interface which allows users to search using regular expressions, boolean queries, and programming specific extensions. The results will be ranked by relevance, taking into account source code semantics (class, method, variable, etc.), will display code snippets, and will allow viewing the full source code. The system should be easy to use, configurable, and responsive.


Large-scale software development and maintenance presents many challenges to developers. As the code base grows, so do the difficulties of staying up to date with the code, finding the right place for implementing new requirements, or trying to fix existing bugs. In addition, as code size increases, sometimes documentation does not get updated or is missing altogether, which makes training new developers difficult and time consuming.

When wandering thru the code, developers often use grep utilities to find a particular piece of code. These utilities let developers search source files for a match with a regular expression, and while powerful, they have three disadvantages. First, writing a regular expression for a search necessitates at least some knowledge about you are searching for – for example, the prefix of a variable name or to know that it ends with a number. Second, the results returned from a grep are all matches to the given regular expression and in that respect are all equally relevant. This means a developer might get hundreds of results and have no idea which one to look at. Third, grep utilities are part of the operating system or are integrated within an IDE and their results cannot be viewed from the web. These days, when so many open source projects are developed over the Internet, searching source code directly from a web browser seems natural.

With the advance of the Internet, web search engines have become an irreplaceable tool for developers. They are not as precise and powerful as grep utilities, but address their disadvantages. Web search engines allow inexact matches, rank the results by relevance, and display them in a web-viewable form. However, these search engines are usually geared toward text searches and do not work well for source code.

Search engines have drawbacks themselves. In programming languages identifier usually combine several words (i.e. ListArray or basic_string), but to search engines a word is just a word. It would be useful if they were able to split an identifier into its composing words. Search engines also compute a relevance score for a particular result based, in part, on how many occurrences of the search keywords appear in it. In source code, however, there is an implied understanding that some occurrences are more important than others. For example, when searching for "Foo", a document with several occurrences of a local variable "Foo" is usually less relevant than a document declaring a class "Foo". Text search engines, of course, don't have the knowledge to perform this analysis. This lack of knowledge also means developers searching for a keyword with particular semantics, say function "Foo" might get results for class "Foo" or variable "Foo" or any other "Foo". A nice addition to the search engine query syntax would be to specify the semantics of the keyword being searched, for example, "function: Foo".

A successful solution would combine the best of web search engines and grep tools, and extend it with knowledge about programming language syntax and source code semantics to allow more intelligent searches that most accurately determine the relevance of search results.


Related resources