xml_module

The XML module enhances graph database capabilities by providing support for loading and parsing XML data.

TraitValue
Module typeutil
ImplementationPython
Parallelismsequential

Functions

parse()

Parses an XML string or file into a map.

In XML file format, every element is represented as a map. For every element, its children elements are represented as a key-value pair inside that map, the key being _children, and the value an array of children elements. But, when simple is True, the key of children elements is not _children, but rather the name of the parent element.

Consider a root element named catalog. When parsing this element, if simple is False, the key-value pair of children elements will look like this: _children: [child_element_1, child_element_2, ....]. But, when simple is True, the key-value pair will look like this _catalog: [child_element_1, child_element_2, ....]. Using simple mode makes nested XML elements accessible via an element name prefixed with an _.

Input:

  • subgraph: Graph (OPTIONAL) ➡ A specific subgraph, which is an object of type Graph returned by the project() function, on which the algorithm is run. If subgraph is not specified, the algorithm is computed on the entire graph by default.

  • xml_input: string ➡ input XML string.

  • simple: bool (default = false) ➡ configuration bool used for specifying whether simple mode should be used. Simple configuration explanation.

  • path: string (default = "") ➡ path to the XML file that needs to be parsed. If the path is not empty, the xml_input string is ignored, and only the file is parsed.

Output:

The output of this function is a parsed XML map.

Usage:

Parsing XML from string

WITH '<catalog><book id="1"><title>Book 1</title><author>Author 1</author><publication><year>2022</year><publisher>Publisher A</publisher></publication></book><book id="2"><title>Book 2</title><author>Author 2</author><publication><year>2023</year><publisher>Publisher B</publisher></publication></book></catalog>' AS xmlString
RETURN xml_module.parse(xmlString) AS value;

Output:

{
   "_children": [
      {
         "_children": [
            {
               "_text": "Book 1",
               "_type": "title"
            },
            {
               "_text": "Author 1",
               "_type": "author"
            },
            {
               "_children": [
                  {
                     "_text": "2022",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher A",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "1"
      },
      {
         "_children": [
            {
               "_text": "Book 2",
               "_type": "title"
            },
            {
               "_text": "Author 2",
               "_type": "author"
            },
            {
               "_children": [
                  {
                     "_text": "2023",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher B",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "2"
      }
   ],
   "_type": "catalog"
}

Parsing with simple configuration

WITH '<catalog><book id="1"><title>Book 1</title><author>Author 1</author><publication><year>2022</year><publisher>Publisher A</publisher></publication></book><book id="2"><title>Book 2</title><author>Author 2</author><publication><year>2023</year><publisher>Publisher B</publisher></publication></book></catalog>' AS xmlString
RETURN xml_module.parse(xmlString, true) AS value;

Output:

{
   "_catalog": [
      {
         "_book": [
            {
               "_text": "Book 1",
               "_type": "title"
            },
            {
               "_text": "Author 1",
               "_type": "author"
            },
            {
               "_publication": [
                  {
                     "_text": "2022",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher A",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "1"
      },
      {
         "_book": [
            {
               "_text": "Book 2",
               "_type": "title"
            },
            {
               "_text": "Author 2",
               "_type": "author"
            },
            {
               "_publication": [
                  {
                     "_text": "2023",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher B",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "2"
      }
   ],
   "_type": "catalog"
}

Parsing from a file, with simple configuration

The following example shows how to parse this file:


file.xml

<catalog>
  <book id="1">
    <title>Book 1</title>
    <author>Author 1</author>
    <publication>
      <year>2022</year>
      <publisher>Publisher A</publisher>
    </publication>
  </book>
  <book id="2">
    <title>Book 2</title>
    <author>Author 2</author>
    <publication>
      <year>2023</year>
      <publisher>Publisher B</publisher>
    </publication>
  </book>
</catalog>

Cypher:

RETURN xml_module.parse("", true,"/home/demonstration/Documents/file.xml") AS value;

Output:

{
   "_catalog": [
      {
         "_book": [
            {
               "_text": "Book 1",
               "_type": "title"
            },
            {
               "_text": "Author 1",
               "_type": "author"
            },
            {
               "_publication": [
                  {
                     "_text": "2022",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher A",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "1"
      },
      {
         "_book": [
            {
               "_text": "Book 2",
               "_type": "title"
            },
            {
               "_text": "Author 2",
               "_type": "author"
            },
            {
               "_publication": [
                  {
                     "_text": "2023",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher B",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "2"
      }
   ],
   "_type": "catalog"
}

Procedures

load()

Loads and parses an XML file from a URL or a local file. Supports simple mode, and XPath expressions. You can choose to execute the procedure in simple mode => Simple configuration explanation.

This procedure supports the usage of XPath expressions. Since the module is implemented in Python, XPath expressions should follow, and are limited to the XPath syntax explained here XML python docs. XPath implemented this way cannot use absolute paths, so one of these 3 prefixes must be used to avoid errors: . .. *. The current node is the root node.

Input:

  • subgraph: Graph (OPTIONAL) ➡ A specific subgraph, which is an object of type Graph returned by the project() function, on which the algorithm is run. If subgraph is not specified, the algorithm is computed on the entire graph by default.

  • xml_url: string ➡ The input URL where the XML file is located.

  • simple: bool (default = false) ➡ A bool used for specifying whether simple mode should be used.

  • path: string (default = "") ➡ A path to the XML file that needs to be parsed. If the path is not empty, xml_input string is ignored, and only the file is parsed.

  • xpath: string (default = "") ➡ XPath expression. If the expression is not empty, the result of the procedure is all elements satisfying the expression.

  • headers: Map (default = {}) ➡ A map of additional HTTP headers used when fetching a file from URL.

Output:

  • output_map: Map ➡ parsed XML map.

If the XPath expression is not empty, the output is all elements that satisfy the expression.

Usage:

This section shows the usage of the procedure on the folllowing XML file

Parse an XML file from URL

WITH "https://www.w3schools.com/xml/note.xml" AS xmlUrl
CALL xml_module.load(xmlUrl, false, "", "", {}) YIELD output_map RETURN output_map;

Output:

{
   "_children": [
      {
         "_text": "Tove",
         "_type": "to"
      },
      {
         "_text": "Jani",
         "_type": "from"
      },
      {
         "_text": "Reminder",
         "_type": "heading"
      },
      {
         "_text": "Don't forget me this weekend!",
         "_type": "body"
      }
   ],
   "_type": "note"
}

Parse with simple configuration

WITH "https://www.w3schools.com/xml/note.xml" AS xmlUrl
CALL xml_module.load(xmlUrl, true, "", "", {}) YIELD output_map RETURN output_map;

Output:

{
   "_note": [
      {
         "_text": "Tove",
         "_type": "to"
      },
      {
         "_text": "Jani",
         "_type": "from"
      },
      {
         "_text": "Reminder",
         "_type": "heading"
      },
      {
         "_text": "Don't forget me this weekend!",
         "_type": "body"
      }
   ],
   "_type": "note"
}

Parse XML from a file

Example of the file:


file.xml

<catalog>
  <book id="1">
    <title>Book 1</title>
    <author>Author 1</author>
    <publication>
      <year>2022</year>
      <publisher>Publisher A</publisher>
    </publication>
  </book>
  <book id="2">
    <title>Book 2</title>
    <author>Author 2</author>
    <publication>
      <year>2023</year>
      <publisher>Publisher B</publisher>
    </publication>
  </book>
</catalog>

Cypher:

CALL xml_module.load("", true, "/home/demonstration/Documents/file.xml", "", {}) YIELD output_map RETURN output_map;

Output:

{
   "_catalog": [
      {
         "_book": [
            {
               "_text": "Book 1",
               "_type": "title"
            },
            {
               "_text": "Author 1",
               "_type": "author"
            },
            {
               "_publication": [
                  {
                     "_text": "2022",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher A",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "1"
      },
      {
         "_book": [
            {
               "_text": "Book 2",
               "_type": "title"
            },
            {
               "_text": "Author 2",
               "_type": "author"
            },
            {
               "_publication": [
                  {
                     "_text": "2023",
                     "_type": "year"
                  },
                  {
                     "_text": "Publisher B",
                     "_type": "publisher"
                  }
               ],
               "_type": "publication"
            }
         ],
         "_type": "book",
         "id": "2"
      }
   ],
   "_type": "catalog"
}

Use XPath

For the XPath demonstration, cd_catalog.xml file will be used.

The XPath expression is going to be ".//CD[YEAR='1988']", which will return all CD elements with attribute year equaling 1988. Note that XPath expressions cannot be absolute paths because of the Python implementation of XPath, so . is used as an XPath prefix for this example, meaning the search will start from the current (root) element.

WITH "https://www.w3schools.com/xml/cd_catalog.xml" AS xmlUrl
CALL xml_module.load(xmlUrl, false, "", ".//CD[YEAR='1988']", {}) YIELD output_map RETURN output_map;

Result:


{
   "_children": [
      {
         "_text": "Hide your heart",
         "_type": "TITLE"
      },
      {
         "_text": "Bonnie Tyler",
         "_type": "ARTIST"
      },
      {
         "_text": "UK",
         "_type": "COUNTRY"
      },
      {
         "_text": "CBS Records",
         "_type": "COMPANY"
      },
      {
         "_text": "9.90",
         "_type": "PRICE"
      },
      {
         "_text": "1988",
         "_type": "YEAR"
      }
   ],
   "_type": "CD"
}

{
   "_children": [
      {
         "_text": "Stop",
         "_type": "TITLE"
      },
      {
         "_text": "Sam Brown",
         "_type": "ARTIST"
      },
      {
         "_text": "UK",
         "_type": "COUNTRY"
      },
      {
         "_text": "A and M",
         "_type": "COMPANY"
      },
      {
         "_text": "8.90",
         "_type": "PRICE"
      },
      {
         "_text": "1988",
         "_type": "YEAR"
      }
   ],
   "_type": "CD"
}