NLBean version 5: A Natural Language Interface for Databases

Copyright 2001, Mark Watson. All rights reserved.

The NLBean was written to "scratch an itch" that I had in 1977. I was experimenting with the Java JDBC APIs, and it occurred to me that a relatively simple text-skimming NLP system could convert limited-domain English queries to SQL automatically. This problem is made easier by the limited vocabulary: many of the words used in English queries would be terms in the database (e.g., table column names). The original version was approximately 9000 lines of Java code, much of that to support a client-server architecture (I was also having fun with RMI). The current version (version 5) has all of the client server code removed and most of the experimental code that I have played with in the last 4 years. The code base is now approximately 1600 lines of Java code, and is a bit more readable.

1. Design and implementation of the NLBean

The NLBean was a hack from the first day that I started writing it. Most of the initial design effort supported the now trashed RMI client-server architecture. In this section, we will look at a road map of the software. The following list summarizes the responsibilities for all NLBean classes:

1.1 DBInfo Class

The DBInfo class is used to manage the data associated with a database. One instance of the DBInfo class manages all tales in a single database. The following class data is used:

Note that we are storing some information redundantly here: the user name and password are specific to an entire database, but we store this information indexed by table number. This is a programming convenience that costs a small amount of additional storage. Also, the maximum number of tables that can be loaded into the NLBean is set to a maximum of ten because of the use of static arrays instead of Java vectors.

The following methods supply the behavior of the DBInfo class:

1.2 DBInterface Class

The DBInterface class encapsulates all database access in one class so that you can add support for alternative database products, etc., by modifying a single small piece of code. All class data and methods are static, so you never create an instance of the DBInterface class. A static Boolean variable needToInit is used to ensure that the database access setup calls are only executed one time.

The following static methods are used to implement the class behavior (only the methods query, update, and getColumnNames are public):

1.3 Help Class

The Help class is derived from java.awt.Dialog class. This class contains text explaining the use of the NLBean.

1.4 MakeTestDB Class

The class MakeTestDB contains a static main method so it can be run as a standalone program. Running MakeTestDB creates a test database containing three tables: NameTable, Products, and Employees. This class uses the DBInterface utility class to access a local InstantDB database.

1.5 NLBean Class

The NLBean class is the main application class for this demo system. It is derived from the class java.awt.Panel and provides both a user interface and natural language processing behavior by using instances of classes NLEngine and NLP. The original NLBean system could be used as a JavaBean component or a standalone application. In order to make the NLBean a simpler example for this book, I removed code that allowed the NLBean to function as a full-featured JavaBean; currently the NLBean can only be run as a standalone demo application.

The NLBean class contains several internal helper class definitions that support the user interface:

Most of the code in the NLBean class is user interface specific and was written for the original 1997 version of the NLBean.

1.6 NLEngine Class

The class NLEngine is used by the NLBean user interface code to perform natural language queries against either the test database, or any other database if DBInterface is modified to support the new database system, if required. The NLEngine class stores information for all loaded databases and tables. This class converts natural language queries to SQL statements. This class uses the generated SQL statement for a natural language query to query the database using the DBInterface class.

The public API for the NLEngine class is:

1.7 NLP Class

The NLP class is the top-level class responsible for parsing natural language queries. The NLP class maintains an array of strings currentWords and an index into this array currentWordIndex while parsing a natural language query.

The class methods are:

This class is really a "text skimmer", rather than a parser. The following list shows the types of patterns that are recognized:

In addition to the simple form in these two examples, any "where" clause can be compound with clauses separated by "and" or "or" keywords.

1.8 SmartDate Class

The SmartDate class is used to detect the presence of legal date string in a natural language query. This class recognizes many possible date formats by using the Java Calendar and SimpleDateFormat classes to attempt to parse any test string.

2 Running the NLBean NLP System

The ZIP file creates a directory NLBean and the source code is in the subdirectory nlbean (which is the Java package name). If you are running Windows, use the build.bat and run.bat files to build the system (you only need to do this once) and to run it.

If you are using Linux or UNIX, use the following commands to build the system:

javac -classpath idb.jar:. -d . nlbean/*.java
jar cvf nlbean.jar nlbean/*.class
rm nlbean\*.class

Then use the following to tun the NLBean:

java -classpath nlbean.jar:idb.jar:. nlbean.NLBean

Return to: www.markwatson.com