CodeCrawler will be a system, geared toward developers, which makes searching
source code as easy as searching the Internet and as powerful as using a grep
tool. It will provide a web interface which allows users to search using regular
expressions, boolean queries, and programming specific extensions. The results
will be ranked by relevance, taking into account source code semantics (class,
method, variable, etc.), will display code snippets, and will allow viewing
the full source code. The system should be easy to use, configurable, and responsive.
Large-scale software development and maintenance presents many challenges to
developers. As the code base grows, so do the difficulties of staying up to
date with the code, finding the right place for implementing new requirements,
or trying to fix existing bugs. In addition, as code size increases, sometimes
documentation does not get updated or is missing altogether, which makes training
new developers difficult and time consuming.
When wandering thru the code, developers often use grep utilities to find a
particular piece of code. These utilities let developers search source files
for a match with a regular expression, and while powerful, they have three disadvantages.
First, writing a regular expression for a search necessitates at least some
knowledge about you are searching for – for example, the prefix of a variable
name or to know that it ends with a number. Second, the results returned from
a grep are all matches to the given regular expression and in that respect are
all equally relevant. This means a developer might get hundreds of results and
have no idea which one to look at. Third, grep utilities are part of the operating
system or are integrated within an IDE and their results cannot be viewed from
the web. These days, when so many open source projects are developed over the
Internet, searching source code directly from a web browser seems natural.
With the advance of the Internet, web search engines have become an irreplaceable
tool for developers. They are not as precise and powerful as grep utilities,
but address their disadvantages. Web search engines allow inexact matches, rank
the results by relevance, and display them in a web-viewable form. However,
these search engines are usually geared toward text searches and do not work
well for source code.
Search engines have drawbacks themselves. In programming languages identifier
usually combine several words (i.e. ListArray or basic_string), but to search
engines a word is just a word. It would be useful if they were able to split
an identifier into its composing words. Search engines also compute a relevance
score for a particular result based, in part, on how many occurrences of the
search keywords appear in it. In source code, however, there is an implied understanding
that some occurrences are more important than others. For example, when searching
for "Foo", a document with several occurrences of a local variable
"Foo" is usually less relevant than a document declaring a class "Foo".
Text search engines, of course, don't have the knowledge to perform this analysis.
This lack of knowledge also means developers searching for a keyword with particular
semantics, say function "Foo" might get results for class "Foo"
or variable "Foo" or any other "Foo". A nice addition to
the search engine query syntax would be to specify the semantics of the keyword
being searched, for example, "function: Foo".
A successful solution would combine the best of web search engines and grep
tools, and extend it with knowledge about programming language syntax and source
code semantics to allow more intelligent searches that most accurately determine
the relevance of search results.