NLBean version 5: A Natural Language Interface for Databases
Copyright 2001, Mark Watson. All rights reserved.
The NLBean was written to "scratch an itch" that I had in 1977. I was experimenting with the Java
JDBC APIs, and it occurred to me that a relatively simple text-skimming NLP system could convert limited-domain
English queries to SQL automatically. This problem is made easier by the limited vocabulary: many of the words
used in English queries would be terms in the database (e.g., table column names). The original version was approximately
9000 lines of Java code, much of that to support a client-server architecture (I was also having fun with RMI).
The current version (version 5) has all of the client server code removed and most of the experimental code that
I have played with in the last 4 years. The code base is now approximately 1600 lines of Java code, and is a bit
more readable.
1. Design and implementation of the NLBean
The NLBean was a hack from the first day that I started writing it. Most of the initial design effort supported
the now trashed RMI client-server architecture. In this section, we will look at a road map of the software. The
following list summarizes the responsibilities for all NLBean classes:
- DBInfo ñ this class encapsulates the data required to keep track of the information for a single database resource;
all information for the tables in a database are stored in the same DBInfo instance
- DBInterface ñ this class contains the static methods Query and Update used for all database access; this class
is set up to use both InstantDB and IBM's DB2, defaulting to InstantDB. Supporting other databases is usually a
simple as setting the URL for the database and login information.
- Help ñ this class is derived from the standard Dialog class and is used to show help information
- MakeTestDB ñ this classes creates test database tables for running the NLBean as a standalone demo
- NLBean ñ the main class for the NLBean system. This class uses an instance of NLEngine to perform NLP operations
- NLEngine ñ the top-level class for NLP operations, including adding database tables to the system, parsing
natural language queries, etc.
- NLP ñ a helper class for performing NLP operations, including translating SQL queries back into a natural language
form for display
- SmartDate ñ a utility for parsing and recognizing dates in a variety of formats
1.1 DBInfo Class
The DBInfo class is used to manage the data associated with a database. One instance of the DBInfo class manages
all tales in a single database. The following class data is used:
- columnNames ñ a two-dimensional array of strings used to store the names of all column names in every table
in a database. The first array index is the table number in the database and the second index is the column number
index. This data will be added to the static word dictionary (or lexicon) and be used in parsing natural language
queries.
- databaseNames ñ an array of strings containing the names of databases that have been loaded into the NLBean.
Indexed by table number.
- numTables ñ the total number of tables loaded into the system
- password ñ Indexed by table number.
- userNames ñ the user name for each table. Indexed by table number.
Note that we are storing some information redundantly here: the user name and password are specific to an entire
database, but we store this information indexed by table number. This is a programming convenience that costs a
small amount of additional storage. Also, the maximum number of tables that can be loaded into the NLBean is set
to a maximum of ten because of the use of static arrays instead of Java vectors.
The following methods supply the behavior of the DBInfo class:
- DBInfo ñ class constructor that statically allocates the arrays for holding table information.
- addTable ñ adds data for table name, database login information, and column names for a specific table.
- clearTables ñ removes all tables from the NLBean system
- debug ñ prints out information for all tables that have been loaded
- findColumnName ñ given a column name, this method returns an array of all tables that contain that column name
- isColumn ñ returns Boolean true if a string is a valid database column name, otherwise returns a Boolean false
value.
- isTable ñ returns Boolean true if a string is a valid database table name, otherwise returns a Boolean false
value.
1.2 DBInterface Class
The DBInterface class encapsulates all database access in one class so that you can add support for alternative
database products, etc., by modifying a single small piece of code. All class data and methods are static, so you
never create an instance of the DBInterface class. A static Boolean variable needToInit is used to ensure that
the database access setup calls are only executed one time.
The following static methods are used to implement the class behavior (only the methods query, update, and getColumnNames
are public):
- checkConnection ñ this method is passed an instance of the class java.sql.SQLwarning and determines if the
current database connection is OK
- doInit ñ if required, this method loads the drivers for the current database product (set up for InstantDB)
and connects to the selected database
- getColumnNames ñ returns an array of strings containing the column names of the specified table
- query ñ used to do SQL queries against a connected database
- resultSetToString ñ a private utility method for converting a java.sql.ResultSet object to a string
- update ñ used to do SQL updates (i.e., to modify a connected database). This method is not used in the NLBean,
but it is used in the utility class MakeTestDB for creating a test database.
1.3 Help Class
The Help class is derived from java.awt.Dialog class. This class contains text explaining the use of the NLBean.
1.4 MakeTestDB Class
The class MakeTestDB contains a static main method so it can be run as a standalone program. Running MakeTestDB
creates a test database containing three tables: NameTable, Products, and Employees. This class uses the DBInterface
utility class to access a local InstantDB database.
1.5 NLBean Class
The NLBean class is the main application class for this demo system. It is derived from the class java.awt.Panel
and provides both a user interface and natural language processing behavior by using instances of classes NLEngine
and NLP. The original NLBean system could be used as a JavaBean component or a standalone application. In order
to make the NLBean a simpler example for this book, I removed code that allowed the NLBean to function as a full-featured
JavaBean; currently the NLBean can only be run as a standalone demo application.
The NLBean class contains several internal helper class definitions that support the user interface:
- MouseHelp ñ an adapter class to handle events from the "help" button. The method mouseReleased causes
the help window to be visible.
- ChoiceListener ñ derived from java.awt.ItemListener. The method itemStateChanged is called when the example
choice control is changed.
- MouseSelect1 ñ derived from the adapter class java.awt.MouseAdapter to handle events in the top left database
selection list.
- MouseSelect2 ñ derived from the adapter class java.awt.MouseAdapter to handle events in the top middle database
table selection list.
- MouseSelect3 ñ derived from the adapter class java.awt.MouseAdapter to handle events in the top left database
table column name selection list.
- MouseQuery ñ derived from the adapter class java.awt.MouseAdapter. The method mouseReleased starts the database
query process when the "query" button is clicked.
Most of the code in the NLBean class is user interface specific and was written for the original 1997 version
of the NLBean.
1.6 NLEngine Class
The class NLEngine is used by the NLBean user interface code to perform natural language queries against either
the test database, or any other database if DBInterface is modified to support the new database system, if required.
The NLEngine class stores information for all loaded databases and tables. This class converts natural language
queries to SQL statements. This class uses the generated SQL statement for a natural language query to query the
database using the DBInterface class.
The public API for the NLEngine class is:
- NLEngine ñ class constructor that creates instances of classes DBInfo and NLP.
- addDB ñ used to add database information to the system
- addSynonym ñ used to define a new parsing synonym
- breaklines ñ utility to convert a single Java string tht contains multiple lines into an array of strings
- clearDB ñ removes all loaded database information
- clearSynonyms ñ removes all loaded synonym information
- createResultSet ñ used to make a database query from a generated SQL statement
- getColumnNames ñ used return the column names generated for a SQL query
- getRows ñ used to execute a SQL query and return all lines as a single Java string
- getSQL ñ calls the NLP class getSQL method to get the last generated SQL statement
- initDB ñ initializes database data and connections
- parse ñ performs some preprocessing and cleanup of natural language queries and then calls the NLP class parse
method.
- toEnglish ñ calls the NLP class toEnglish method to convert SQL statements back into a natural language representation
for display
1.7 NLP Class
The NLP class is the top-level class responsible for parsing natural language queries. The NLP class maintains
an array of strings currentWords and an index into this array currentWordIndex while parsing a natural language
query.
The class methods are:
- NLP ñ class constructor that requires an instance of DBInfo
- eatColumnName ñ processes and removes a column name from a query.
- eatWord ñ a utility method that is passed an array of strings; any words in this array at the current word
index are processed and removed.
- getSQL ñ returns the SQL for the last processed query as a Java string
- parse ñ top level parsing method. There are three parsing modes: a new query, processing an "and clause",
and processing an "and <condition>" clause.
- quoteLiteral ñ adds single quote marks, if required, around a literal before insertion into a generated SQL
query
- toEnglish ñ converts an SQL query back into natural language
This class is really a "text skimmer", rather than a parser. The following list shows the types of
patterns that are recognized:
- List <name of any column in a database> where <any other column name> equals <a string>
- List <name of any column in a database> where <any other column name> is after <any date expression>
In addition to the simple form in these two examples, any "where" clause can be compound with clauses
separated by "and" or "or" keywords.
1.8 SmartDate Class
The SmartDate class is used to detect the presence of legal date string in a natural language query. This class
recognizes many possible date formats by using the Java Calendar and SimpleDateFormat classes to attempt to parse
any test string.
2 Running the NLBean NLP System
The ZIP file creates a directory NLBean and the source code is in the subdirectory nlbean (which is the Java
package name). If you are running Windows, use the build.bat and run.bat files to build the system (you only need
to do this once) and to run it.
If you are using Linux or UNIX, use the following commands to build the system:
javac -classpath idb.jar:. -d . nlbean/*.java
jar cvf nlbean.jar nlbean/*.class
rm nlbean\*.class
Then use the following to tun the NLBean:
java -classpath nlbean.jar:idb.jar:. nlbean.NLBean