HowTo - How to load XML files into a database with Tabulify
Howto
This database howto will show you how to load Xml file and document into a database.
Steps
Checking that Postgres is up
Normally our standard database howto are created with Sqlite but because Sqlite does not support any SQL XML function, nor native XML data type, we have created this howto with the Postgres database.
The next steps will be using the postgres howto connection.
- You need to start the postgres docker image.
- Ping your connection to check that the database is up.
tabli connection ping postgres
The connection (postgres) has been pinged successfully
The XML file to load
We will load the XML file relation.xml
- located in the XML subdirectory of the howto directory connection
- that contains information over the relation wikipedia page
- that have been created by the mediawiki API
<?xml version="1.0"?>
<api batchcomplete="">
  <query>
    <normalized>
      <n from="Relation_(database)" to="Relation (database)"/>
    </normalized>
    <pages>
      <page _idx="23720058" pageid="23720058" ns="0" title="Relation (database)" description="The base data structure of relational databases">
        <categories>
          <cl ns="14" title="Category:Articles with example SQL code"/>
          <cl ns="14" title="Category:Database management systems"/>
          <cl ns="14" title="Category:Relational model"/>
        </categories>
      </page>
    </pages>
  </query>
</api>
The structure of an XML file
Default
An XML resource is seen by Tabulify by default as a tabular resource that have:
- one column with the name xml and the xml data type
- one row.
tabli data struct xml/relation.xml@howto
Structure of the resource xml\relation.xml@howto
position   name   data_type   precision   scale   primary_key   nullable   autoincrement   comment   
--------   ----   ---------   ---------   -----   -----------   --------   -------------   -------   
1          xml    sqlxml                                        x          false
Changing the XML column name
You can change the name of the xml column with the columnName attribute
For instance, to name the column data, you would set the --attribute option with the value column_name=data
tabli data struct --attribute column_name=data xml/relation.xml@howto
Structure of the resource xml\relation.xml@howto
position   name   data_type   precision   scale   primary_key   nullable   autoincrement   comment   
--------   ----   ---------   ---------   -----   -----------   --------   -------------   -------   
1          data   sqlxml                                        x          false
Adding context with virtual columns
You can add extra information about the XML resource loaded via the virtual column.
For instance, if you want to add the logical name.
tabli data struct --virtual-column filename=logical_name xml/relation.xml@howto
Structure of the resource xml\relation.xml@howto
position   name       data_type   precision   scale   primary_key   nullable   autoincrement   comment   
--------   --------   ---------   ---------   -----   -----------   --------   -------------   -------   
1          xml        sqlxml                                        x          false                     
2          filename   varchar                                       x          false
Loading the XML file into the Postgres database
To load an XML file, you can use any transfer command.
We will be using the insert command that will:
tabli data insert --virtual-column name=logicalName xml/relation.xml@howto wikipedia@postgres
Source                   Target               Latency (ms)   Row Count   Error   Message   
----------------------   ------------------   ------------   ---------   -----   -------   
xml\relation.xml@howto   wikipedia@postgres   68             1
We can see that there is one row loaded in the wikipedia table with the list command
tabli data list -a count wikipedia@postgres
count   
-----   
    1
Loading from a Web API
Tabulify supports also to load XML data (even JSON) from a URL.
Let's add the wikipedia XML data for the page SQL wikipedia page
This command line will work only with a bash based terminal. More … see Why Bash
tabli data insert "https://en.wikipedia.org/w/api.php?action=query&titles=SQL&format=xml&prop=description|categories" wikipedia@postgres
Source                                                                                                                           Target               Latency (ms)   Row Count   Error   Message   
------------------------------------------------------------------------------------------------------------------------------   ------------------   ------------   ---------   -----   -------   
https://en.wikipedia.org/w/api.php?action=query&titles=SQL&format=xml&prop=description%7Ccategories@https:--en.wikipedia.org-w   wikipedia@postgres   690            1
And we have now 2 rows
tabli data list -a count wikipedia@postgres
count   
-----   
    2
Analyzing the XML with the SQL XML functions
Now that our XML have been loaded, we can analyze them with SQL XML function. Postgres documents all this XML function on this page.
For instance, to extract the title and description, we can use the xpath function
SELECT
  CAST((xpath('/api/query/pages/page/@title', xml))[1] AS text) AS title,
  CAST((xpath('/api/query/pages/page/@description', xml))[1] AS text) AS description
FROM wikipedia;
Let's run this Query SQL select
tabli data print (postgres/xml_wikipedia.sql@howto)@postgres
title                 description                                               
-------------------   -------------------------------------------------------   
Relation (database)   The base data structure of relational databases           
SQL                   Language for management and use of relational databases
Next
You can explore further, the XML Postgres functions
select 
    xml, -- Because XML is an XML object, you will normally not see the content in a database IDE
    cast(xml as text) -- By casting it as text, you will see the content
    xml is document,  -- You can also check that this is a XML document
    xml_is_well_formed_document(cast(xml as text)), -- And that it's well formed
from 
    wikipedia;
or you can also try to load JSON document.