XML/RDB version 1.0
====================
Automatically create relational DB schema & then populate those
schemas directly from an XML file! Works with either a 'plain'
XML file or an XML Schema. Once your RDB is populated with data
from your XML file(s) - you can then 'unpopulate' back into XML!
All you need is an XML file or XML Schema, and MySQL or PostgreSQL.
Needs MySQL or PostgreSQL as well as their DBD's and a database
named 'test' - for the 'make test' right out of the box the DB
needs to be running on localhost & have no username/password.
You can edit mysql_test_config or pg_test_config to match your
configuration otherwise in the 't' directory.
INSTALLATION
To install this module type the following:
perl Makefile.PL
make
make test
make install
DEPENDENCIES
This module requires these other modules and libraries:
'XML::DOM' => '1.29',
'DBIx::DBSchema' => '.16',
'DBIx::Recordset' => '.23',
'DBIx::Sequence' => '.04',
'URI::Escape' => '3.16',
'IO::File' => '1.08',
COPYRIGHT AND LICENCE
#####
#
# COPYRIGHT AND LICENSE
# Copyright (c) 2003, Juniper Networks, Inc.
# All rights reserved.
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
# 1. Redistributions of source code must retain the above
# copyright notice, this list of conditions and the following
# disclaimer.
# 2. Redistributions in binary form must reproduce the above
# copyright notice, this list of conditions and the following disclaimer
# in the documentation and/or other materials provided with the
# distribution.
# 3. The name of the copyright owner may not be used to
# endorse or promote products derived from this software without specific
# prior written permission.
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR
# IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
# DISCLAIMED. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT,
# INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
# SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
# STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
# IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
# POSSIBILITY OF SUCH DAMAGE.
#
#####
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.
A long-arse how-to & explanation:
An XML document is an ideal transport for data between heterogeneous
systems. XML documents are also an ideal way to represent hierarchical
data generically. Yet when it comes time to store, query, compare, edit,
delete, and even create these data-centric documents, few mature XML tools
exist.
Fortunately, there is an older technology that has been successfully
handling these tasks for years. It also has (fairly) standard syntax
and a standard query language (in fact it has _the_ Standard Query Language
(1)). Everyone's got one - your pal and mine - the Relational Database
Management System (RDBMS).
While native XML Databases are best suited for storing document-centric
data like XHTML files, data-centric documents, like a Juniper Network's(tm)
router configuration, are best stored within a RDBMS, a system which is tuned
for data storage and manipulation (2).
By bringing these two technologies together, we can leverage each
system's strengths, minimize their weaknesses, and learn that by using Perl,
two seemingly opposed technologies can become friends.
An example of hierarchical data is a router's configuration.
Using a Perl module, an XML-ified version of a Juniper router's
configuration can be retrieved easily. Almost any bit of router information
can be requested and received in XML using Juniper Network's JUNOScript (3)
technology. Using a standard set of XML libraries, these XML documents are
easily manipulated at a low level. Unfortunately, higher-level tools to
manipulate XML are either immature, unproven, unknown, or simply non-existent.
But by putting XML documents into a RDBMS we have access to all of the robust,
mature data manipulation tools we need.
The ability to map XML to RDBMSs (and then back into XML) plays to
both system's strengths: XML provides self-describing data transport,
and the RDBMS provides data managing and manipulating tools.
But how should we put an XML document into a RDBMS? The quick and dirty
answer is to store the entire XML document in a RDBMS as a Binary
Large Object (BLOB) or Character Large Object (CLOB) - but these approaches
solve very little and certainly do not take full advantage of the RDBMS. Other
existing tools force you to pre-create your RDBMS schema to match your XML
Documents, or require you to create either an XML or proprietary template to
define the mapping between your XML documents and your RDBMS table sets. But
I thought XML was self-describing - why should we have to describe our data
twice? Fortunately we don't have to. Using the self-describing nature of
XML documents, and Perl, there is a better way.
Relationships
The key to transforming XML into a RDBMS is analyzing the relationships
in an XML document and then mapping those relationships into a RDBMS.
Let's examine the kinds of relationships utilized by a RDBMS - there are
three:
1. 1 to 1 relationship (1:1)
We are only interested in the simplest case - the primary entity
must participate in the relationship but the secondary entity may not.
e.g. I own 1 car but my 1 car does not own me (or does it????)
This relationship is modeled by storing the secondary entity's
primary key as a foreign key in the primary entity's table.
2. 1 to N relationship (1:N)
There is only one case for our purposes - the primary entity may
possess multiple secondary entities.
e.g. I own zero or more books.
This relationship is modeled by storing the primary entity's (the '1')
primary key as a foreign key in the secondary entity's (the 'N')
table.
3. N to N relationship (N:N)
For the purposes of transforming XML we do not need these!
e.g. the relationship between students and classes - each student
can have multiple classes and each class can have multiple
students.
This relationship is modeled by creating a new table whose rows hold
the primary key from each foreign table.
XML documents can be distilled into just the first two kinds of RDBMS
relationships. Let's look at some XML:
My Address Book
Mark
Perl Place
Bob
Heck Ave.
FL
Here is the 'root' entity in this XML fragment and has
two sub-entities, and . and form a
1:1 relationship and and form a 1:N relationship.
Similarly there are 1:1 relationships between and , ,
and . That's all we need to know!
Without further ado, let's put Perl to work.
Module #1 - MakeTables
Our first Perl script does exactly what we just did - it analyzes the
relationships between the entities in an XML document and outputs those
relationships as a set of RDB tables. It takes one command-line argument -
the XML file you want to analyze. Here's the generated table that
corresponds to the entity:
CREATE TABLE gen_address_book (
gen_name_id integer NULL,
id integer NOT NULL,
PRIMARY KEY (id)
);
Lots to note here - first, all table names are prefixed by a user-supplied
string - in this case 'gen' (for 'generated'). Also, some characters that
offend RDBMSs are transformed into underscores (don't worry, the real names
are also stored in the database for exporting back to XML). Finally, a
generated primary key column is added to each table (named 'id').
One to one relationships
So what we've got is a table that contains a reference to a row in
the 'gen_name' table to model our 1:1 relationship between
and . The primary key of a 'gen_name' row (the 'id' value) becomes
a foreign key in a 'gen_address_book' row (the 'gen_name_id' value).
One to many relationships
But what about the 1:N relationship between and ?
As good RDBMS design tells us, it is modeled by placing the primary key
of the '1' part of the relationship as a foreign key in the 'N' part of the
relationship. Let's look at the table generated for the entity:
CREATE TABLE gen_entry (
gen_address_book_fk integer NOT NULL,
gen_name_id integer NULL,
gen_state_id integer NULL,
gen_street_id integer NULL,
id integer NOT NULL,
PRIMARY KEY (id)
);
The interesting bit is the 'gen_address_book_fk' column. This column
will contain the primary key of a gen_address_book row that contains this
. The other columns you will recognize as parts of a 1:1
relationship between and , , and . And of
course the generated primary key column 'id'.
Plain old text
Let's now look at the gen_name table:
CREATE TABLE gen_name (
gen_name_type_attribute text NULL,
gen_name_value text NULL,
id integer NOT NULL,
PRIMARY KEY (id)
);
The column 'gen_name_value' holds the text associated with this entity.
The column 'gen_name_type_attribute' holds the text associated with the
attribute 'type' in the entity . Again, there is a generated table
that contains the mappings between RDBMS column and table names and XML
names.
The other tables gen_state and gen_street simply consist of a '_value'
column and a generated primary key. Note we did not have to do anything
to generate these tables! We simply fed our XML document to MakeTables.
Meta tables
To help keep track of everything, MakeTables creates four extra
tables to hold meta-information about this XML document. These tables are
only used internally, so you do not have to worry about them.
Two of the tables are used to create primary keys in a generic,
database-independent fashion and are not very interesting for our purposes.
The table 'gen_element_names' contains the mappings between table and
column names to XML names - this is table we alluded to earlier. It looks
like this:
CREATE TABLE gen_element_names (
db_name text NOT NULL ,
xml_name text NOT NULL
);
Every time MakeTables has to generate a RDBMS equivalent name for an
XML tag (every tag and attribute name must be converted), another row gets
added to this table. Here are the rows that get inserted into this table
from our example (generated by MakeTables of course):
INSERT INTO gen_element_names VALUES ('gen_street','street');
INSERT INTO gen_element_names VALUES ('gen_address_book','address-book');
INSERT INTO gen_element_names VALUES ('gen_name_type_attribute','type');
INSERT INTO gen_element_names VALUES ('gen_entry','entry');
INSERT INTO gen_element_names VALUES ('gen_name','name');
INSERT INTO gen_element_names VALUES ('gen_state','state');
Using this table we can accurately re-create our XML document.
The final generated meta table is called 'gen_link_tables'. This table
contains a list of all of the 1:N relationships in the XML document. Like
the 'gen_element_names' table it is only used internally for bookkeeping.
Here's what it looks like:
CREATE TABLE gen_link_tables (
one_table text NOT NULL ,
many_table text NOT NULL
);
And here's the row that gets inserted into it using our example:
INSERT INTO gen_link_tables VALUES ('gen_address_book','entry');
There is only one 1:N relationship in our XML document, so there's only
one row in this table. This table is used by later scripts to
populate and unpopulate the data into and out of our RDBMS.
Now that our tables have been generated, it's time to import them into
our database and populate them. The output of MakeTables is a bunch of
'CREATE TABLE' and 'INSERT' statements. Each RDBMS has its own way to
import these statements - check your documentation. Later we'll see an
real live example using MySQL (10).
Module #2 - PopulateTables
Once our tables exist in our database we need to actually pull the
data out of our XML document and put it into our RDBMS. Remember,
MakeTables only analyzes the relationships between elements in an XML
document - the actual data is ignored. The actual data parsing is the
job of module #2, PopulateTables. It takes only one argument,
the name of the XML document that was passed to MakeTables.
The data contained within that XML document will be transformed and stored
into your RDBMS. Let's take a look at our RDBMS after we've run
PopulateTables. (using MySQL) using our example:
mysql> select * from gen_address_book;
+-------------+----+
| gen_name_id | id |
+-------------+----+
| 1 | 1 |
+-------------+----+
Ok, not too exciting. Let's see the 'gen_name' table:
mysql> select * from gen_name;
+-------------------------+-----------------+----+
| gen_name_type_attribute | gen_name_value | id |
+-------------------------+-----------------+----+
| NULL | My Address Book | 1 |
| Person | Mark | 2 |
| NULL | Bob | 3 |
+-------------------------+-----------------+----+
Now things get a little more interesting! We see our 1:1 relationship
between and via the 'gen_name_id' in the gen_address_book
table matching the 'id' in the gen_name table, and sure enough its value is
our 's name, 'My Address Book'. You'll notice that the
'gen_name_type_attribute' column is null for the two 's that don't
possess this attribute and is set to 'Person' for the that does.
Let's see the 1:N relationships in the 'gen_entry' table:
mysql> select * from gen_entry;
+---------------------+-------------+--------------+---------------+----+
| gen_address_book_fk | gen_name_id | gen_state_id | gen_street_id | id |
+---------------------+-------------+--------------+---------------+----+
| 1 | 2 | NULL | 1 | 1 |
| 1 | 3 | 1 | 2 | 2 |
+---------------------+-------------+--------------+---------------+----+
The two 's associated with this are linked by
the foreign key column. The 'gen_address_book_fk' column matches the
'id' column in our gen_address_book row. You can also see the 1:1
relationship between each entry and its name (via the 'gen_name_id' column).
The state and street 1:1 relationships are similar.
It's in! We've transformed our XML document into our RDBMS. At
this point we can take a break, sip some coffee, and rest assured that our
RDBMS has our data under its watchful eye. We can use any
of the mature RDBMS utilities and tools to massage, view, change, add, backup,
and delete our information. For some people this could indeed be the end
of the line - but not for us!
Modules #3 - UnpopulateTables
Our XML is in our RDBMS - now we want to get it out! We've put our data
through the RDBMS tool's ringer, doing all the zany things to it we wanted -
but now we want our XML back. Say hello to module #3, UnpopulateTables. But
before we get too acquainted there is one piece about script #2,
PopulateTables that I have not yet mentioned, that makes our life easier.
After PopulateTables finishes populating our tables, it returns a handy 2
element array containing the root table name & a primary key uniquely
identifying a row in that table that corresponds to the XML file we just
inserted into our DB.
It has told us all we need to know to re-create the XML we just RDBMS-ized.
UnpopulateTables takes that very same two element array - the name of the 'root'
table and the generated primary key for the row we're interested in. I'll
discuss what that second bit of output is later. In our example the root
table is 'gen_address_book', and since it's the first one we added to our
database its primary key is one. This tells UnpopulateTables where to start
unwinding from our RDBMS back into XML. If you knew the table name and
primary key of any other row you could create just a fragment of your XML
by specifying those values to UnpopulateTables (all easily gleaned from your
RDBMS). And out goes your XML.
Bonus Module #4 - UnpopulateSchema
We've come full circle. What a not-so-long-but-definitely strange trip
it's been. We've gone from an XML document into a RDBMS and back out again.
Yet something is still missing. The problem lies with not-fully-specified XML
documents. What if another document had a tag within
the tag? Two different table sets would be generated. That
makes it impossible to put two different-yet-related XML documents into the
same set of tables. What is needed, when generating RDBMS tables from
MakeTables, is a 'fully-specified' XML document containing all possible
tags and attributes in all possible configurations. Then all 1:1 and 1:N
relationships could be correctly identified and all attributes could be
accounted for. But how can we get a 'fully-specified' XML document?
We can generate one using XML Schema (4). Written in XML themselves,
XML Schemas fully specify what may be contained within a conforming XML
document. The bonus fourth script - UnpopulateSchema - will 'unpopulate' an
XML Schema stored in your RDBMS as a fully-specified XML document. This XML
document can then be fed to MakeTables to generate fully-specified RDBMS
tables.
Now that's quite a mouthful, but if you have an XML Schema for your
documents, you can use that, and not a specific instantiation of
that Schema (which might not have all of the allowed entities in all possible
configurations) to create your RDBMS tables. Then all conforming XML
documents can be fed to PopulateTables to populate your RDBMS without worrying
about table mismatch.
The secret is that XML Schemas are well-formed XML documents themselves.
Running MakeTables on your Schema itself and then importing those tables
into your RDBMS gets the ball rolling. Then you populate your RDBMS the
usual way by running PopulateTables with the XML Schema as the supplied XML
Document. Finally running UnpopulateSchema instead of UnpopulateTables against
that data will output a fully-specified XML document, instead of just your
XML Schema back again. Since all XML Schemas must follow strict guidelines,
UnpopulateSchema only needs to know the primary key of the XML Schema in your
RDBMS. This is the second bit of information output by PopulateTables.
XML Schema is a very complicated specification. Not all of the nooks
and crannies of the specification are supported by UnpopulateSchema - which
is by far the longest and most complicated for the four scripts.
Here's what's supported - the numbers in parentheses correspond to sections
in the XML Schema Primer (5):
Named Simple and Complex types (2.2 & 2.3)
Simple type restrictions and enumerations (2.3)
List types (2.3.1)
Unions types (2.3.2)
Anonymous Type Definitions and choices (2.4)
Complex Types from Simple Types (simpleContent) (2.5.1)
Mixed content (2.5.2)
Empty content (2.5.3)
Choice and Sequence groups (including xsd:group) (2.7)
'All' group (2.7)
Attribute groups (2.8)
Nil Values (2.9)
Deriving Types by Extension (4.2)
Deriving Complex Types by Restriction (4.4)
Abstract Elements & Types (partially) (4.7)
Here's what's not:
anyType (2.5.4) (not applicable)
Target Namespaces & Unqualified locals (3.1)
Qualified locals (3.2)
Importing & Multiple documents (4.1)
Redefining Types & Groups (4.5)
Substitution Groups (4.6)
Abstract Elements & Types (partially) (4.7)
Controlling the Creation & Use of Derived Types (4.8) (not applicable)
Specifying Uniqueness (5.1) (not applicable)
Defining Keys & their References (5.2) (not applicable)
Importing Types (5.4)
Any Element, Any Attribute (5.5) (not applicable)
Schema Location (5.6)
The namespace and importing can be handled by collecting all referenced
Schemas by hand and creating one large document from them. This list is
subject to change - especially the namespace and importing functions.
Now Things Get Interesting!
Let your mind go! Not only can _any_ XML document be stored in your
RDBMS, but you don't even have to have a XML document to start with. All
you need is an XML Schema. Use that to create your table set. You don't
have to use PopulateTables to populate the database - use whatever tool you
want. When you're ready, use UnpopulateTables and you've magically got
well-formed, valid XML to pass on to whomever you choose. Any XML document--
SOAP (6), SVG (7), XSLT (8), whatever, can easily be intelligently imported
into your RDBMS. Once XML Schema really gets going and all XML documents
are defined using it, you've got a ready-made RDBMS system just waiting for
conforming XML documents.
What You Need To Make It Work
In your constructor call to XML::RDB you provide the path to your
configuration file. The format of the file is key/values pairs - one per
line - delimited by '='. See 'config.test' in the base directory for
all of the options.
the most important (& only) variable you must set is 'DSN':
DSN=DBI:mysql:database=TEST
This is a MySQL DSN - alter it to fit your needs. You can also change the
prefix of each generated table by altering this line:
# All tables names will being with this string
TABLE_PREFIX=gen
The _only_ thing you need to change is the DSN, after that you are ready
to rumble.
Module Dependencies
As Sir Isaac Newton stood on the shoulders of giants, so have I. These
scripts could not function without these great modules available from the
CPAN (9):
DBI and DBD::
You need these to talk to your RDBMS at a low level.
DBIx::Recordset
The scripts use this awesome module to talk to your RDBMS as a
higher level.
DBIx::Sequence
This module provides RDBMS-independent unique primary key generation.
DBIx::DBSchema
This module provides MakeTables with a RDBMS-independent way to
generate tables.
XML::DOM
The workhorse - parses all the XML so I don't have to!
URI::Escape
Only used by UnpopulateTables to keep the XML clean.
Tested platforms
The scripts were developed using Perl 5.6.0 and also run under 5.6.1.
There's no reason why the scripts should not run on any recent Perl5
distribution. Both MySQL and PostgreSQL(11) have been tested - MySQL on
FreeBSD 4.2 and Linux (RedHat 7.1) and PostgreSQL on Linux (RedHat 7.1).
Note column lengths can get long, and PostgreSQL is by default limited
to 31-character long columns. I had to recompile PostgreSQL with a
more reasonable 64-character length limit to keep everything happy.
A Sample Run using MySQL
use XML::RDB;
# Give our DB's DSN & username/password
my $rdb = new XML::RDB(config_file => 'db_config');
# Generate RDB Schema
$rdb->make_tables("my_xml_file.xml", "db_schema_output_file");
#
# Now import the generated 'db_schema_output_file' into your DB
# (see t/1.t for an automated way to do this)
#
# Now populate our RDB
my($root_table_name, $primary_key) = $rdb->populate_tables("my_xml_file.xml");
#
# Your XML file is now in your RDB!!!! Play as desired & when ready....
#
$rdb->unpopulate_tables($root_table_name, $primary_key, 'new_xml_file.xml');
#
# That's all fine & dandy but what if you've got an XML Schema???
#
# the first 2 calls are the same:
$rdb->make_tables("my_xsd_file.xsd", "db_schema_output_file");
#
# don't forget to put 'db_schema_output_file' into your DB!
# then:
my($root_table_name, $primary_key) = $rdb->populate_tables("my_xsd_file.xsd");
# note we only need the primary key for this next call
$rdb->unpopulate_schema($primary_key, 'fully_formed.xml');
#
# Now you've got 'fully_formed.xml' - pass THAT to make_tables & yer golden:
#
$rdb->make_tables("fully_formed.xml", "REAL_RDB_schema");
#
# Now insert REAL_RDB_schema into yer DB & now any XML documents conforming
# to your original XML Schema ('my_xsd_file.xsd') can be imported into
# your schema:
my ($rt, $pk)=$rdb->populate_tables("xml_doc_conforming_to_my_xsd_file.xml");
Limitations and Future Directions
There is a program written in Java that generates XML Schemas from DTDs.
This provides a clear migration path. However, DTDs do not provide as much
information as XML Schemas do, so it would be wise not to count on automated
tools to do the complete conversion for you.
Also, once XML Schema parsers become readily available, UnpopulateSchema
should take advantage of them, since the current 'parsing' it does is pretty
basic. Finally, all XML Schemas must pass through your RDBMS, which is not
optimal.
Some XML documents rely on the order of the entities, but after under-
going into an RDBMS and back out again the order is lost. A 'nice' way to
preserve entity order would be a grand addition.
Both UnpopulateTables and UnpopulateSchema utilize the same intermediary
format from going from a RDBMS to XML. The modularization of these scripts
allow XML stored in a RDBMS to be extracted to any other format, such as
HTML. Future work in this area will produce very interesting transformations.
Also, using something like XML::Writer (10) to output XML would probably be
cleaner and lead to further benefits down the road.
Finally, both MakeTables and PopulateTables use XML::DOM, which loads the
entire XML document tree into memory. Investigation into using the Simple API
for XML (SAX) (12) to reduce memory consumption could prove very fruitful.
Also both UnpopulateTables and UnpopulateSchema load the entire RDB into memory
before outputting their transformations, so investigations into lowering the
memory footprint of these scripts will yield beneficial results.
Acknowledgments
Rob Enns and Phil Shafer for having the foresight to use XML in our
routers that eventually led to these scripts, and Cynthia Tham, fellow
member of the Juniper JUNOScript team!
References
1. Standard SQL ISO/IEC 9075:1992, "Information Technology --- Database Languages --- SQL"
2. For a discussion of XML and RDBMSs see "XML Database Products" by Ronald Bourret, http://www.rpbourret.com/xml/XMLDatabaseProds.htm
3. JUNOScript Guide http://www.juniper.net/techpubs/software/junos44/junoscript44-guide/html/junoscript44-guideTOC.html
4. XML Schema http://www.w3.org/XML/Schema
5. XML Schema Primer http://www.w3.org/TR/xmlschema-0
6. SOAP http://www.w3.org/TR/soap12/
7. SVG http://www.w3.org/Graphics/SVG/Overview.htm8
8. XSL and XSLT http://www.w3.org/Style/XSL/
9. CPAN http://www.cpan.org
10. MySQL http://www.mysql.com
11. PostgreSQL http://www.postgresql.org/
12. SAX http://www.megginson.com/SAX/
Juniper Networks is a registered trademark of Juniper Networks, Inc. Internet Processor, Internet Processor II, JUNOS, JUNOScript, M5, M10, M20, M40, and M160 are trademarks of Juniper Networks, Inc. All other trademarks, service marks, registered trademarks, or registered service marks may be the property of their respective owners. All specifications are subject to change without notice. Copyright © 2001, Juniper Networks, Inc. All rights reserved.