python - Parsing hierarchical information from XML to sqlite -


the scripts post long beg patience. believe easy solve people familiar complexity of xml structure. have highly hierarchical xml file. love make sqlite database based on it. 2 files have far extract sibling information xml file structure (note - frequency of work varies 0 4 or 5; nodes can empty):

<program>     <id>28798031</id>     <programid>12345</programid>     <orchestra>new york philarmonic</orchestra>     <season>1842-43</season>     <concertinfo>         <eventtype>subscription season</eventtype>         <location>manhattan, ny</location>         <venue>apollo rooms</venue>         <date>1842-12-07t05:00:00z</date>         <time>8:00pm</time>     </concertinfo>     <worksinfo>         <work id="52446*">             <composername>beethoven,  ludwig  van</composername>             <worktitle>symphony no. 5 in c minor, op.67</worktitle>             <conductorname>hill, ureli corelli</conductorname>         </work>         <work id="8834*4">             <composername>weber,  carl  maria von</composername>             <worktitle>oberon</worktitle>             <movement>"ozean, du ungeheuer" (ocean, thou mighty monster), reiza (scene , aria), act ii</movement>             <conductorname>timm, henry c.</conductorname>             <soloists>                 <soloist>                     <soloistname>otto, antoinette</soloistname>                     <soloistinstrument>soprano</soloistinstrument>                     <soloistroles>s</soloistroles>                 </soloist>             </soloists>         </work>     </worksinfo> </program> 

i have 2 scripts extract data sibling nodes (the same hierarchical level).

first

import sqlite3 import xml.etree.celementtree  def insert_program(db_conn, id, programid, orchestra, season, concertinfo, worksinfo): curs = db_conn.cursor() curs.execute("insert program values (?,?,?,?,?,?)", (id, programid, orchestra, season, concertinfo, worksinfo)) db_conn.commit()  def program_data_from_element(element): id = element.find("id").text programid = element.find("programid").text orchestra = element.find("orchestra").text season = element.find("season").text concertinfo = element.find("concertinfo").text worksinfo = element.find("worksinfo").text return id, programid, orchestra, season, concertinfo, worksinfo  ## add main loop programs xml file if __name__ == "__main__": conn = sqlite3.connect("program.sqlite3") program = xml.etree.celementtree.parse("complete.xml") program = program.findall("program") index, element in enumerate(program):     id, programid, orchestra, season, concertinfo, worksinfo  = program_data_from_element(element)     insert_program(conn, id, programid, orchestra, season, concertinfo, worksinfo) 

second

import sqlite3 import xml.etree.celementtree  def insert_work(db_conn, workid, composername, worktitle, movement, conductorname): curs = db_conn.cursor() curs.execute("insert work values (?,?,?,?,?)", (workid, composername, worktitle, movement, conductorname)) db_conn.commit()  def work_data_from_element(element): workid = element.get("id") if workid != none:     workid = workid else:     workid = '' composername = element.find("composername") if composername != none:     composername = composername.text else:     composername = '' worktitle = element.find("worktitle") if worktitle != none:     worktitle = worktitle.text else:     worktitle = '' movement = element.find("movement") if movement != none:     movement = movement.text else:     movement = '' conductorname = element.find("conductorname") if conductorname != none:     conductorname = conductorname.text else:     conductorname = '' return workid, composername, worktitle, movement, conductorname   ## add main loop work information xml file if __name__ == "__main__": conn = sqlite3.connect("work.sqlite3") programs = xml.etree.celementtree.parse("complete.xml") work = programs.findall("program/worksinfo/work") index, element in enumerate(work):     workid, composername, worktitle, movement, conductorname = work_data_from_element(element)     insert_work(conn, workid, composername, worktitle, movement, conductorname) 

my issue -- how connect 2 can know programid, season, workid , conductorname, etc., together? thank you!!

whenever working nested, hierarchical xml files need flattened two-dimensional formats database tables, consider xslt solution. information, xslt special-purpose programming language designed transform xml files , general purpose languages, python can process xslt 1.0 scripts, using lxml module.

so both program , work sqlite tables, consider running following xslt scripts transform source xml file , parse xpath (sibling xslt) database import in 1 python script. way, xslt well-formed xml file can loaded file or string other xml. beauty of xslt can hard-code elements may/may not exist returns empty text missing nodes.

a few changes structure:

  1. concatenated children of <concerntinfo> (maybe use separate table?)
  2. removed <worksinfo> program parsing since reflects in work table
  3. added corresponding program <id> work parsing use foreign key in table

program xslt (save .xsl file called in python)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="1.0"> <xsl:output version="1.0" encoding="utf-8" indent="yes" /> <xsl:strip-space elements="*"/>    <xsl:template match="/">     <program>       <xsl:apply-templates select="program"/>     </program>   </xsl:template>    <xsl:template match="program">     <id><xsl:value-of select="id"/></id>     <programid><xsl:value-of select="programid"/></programid>     <orchestra><xsl:value-of select="orchestra"/></orchestra>     <season><xsl:value-of select="season"/></season>     <concertinfo><xsl:value-of select="concat(concertinfo/eventtype, ' ',                    concertinfo/location, ' ', concertinfo/venue, ' ',                     concertinfo/date, ' ', concertinfo/time)"/></concertinfo>   </xsl:template>  </xsl:transform> 

work xslt (save .xsl called in python)

<xsl:transform xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="1.0"> <xsl:output version="1.0" encoding="utf-8" indent="yes" /> <xsl:strip-space elements="*"/>    <xsl:template match="program">     <data>       <xsl:apply-templates select="worksinfo"/>     </data>   </xsl:template>    <xsl:template match="worksinfo">           <xsl:apply-templates select="work"/>       </xsl:template>    <xsl:template match="work">     <xsl:copy>       <programid><xsl:value-of select="ancestor::program/id"/></programid>       <workid><xsl:value-of select="@workid"/></workid>       <composername><xsl:value-of select="composername"/></composername>       <worktitle><xsl:value-of select="worktitle"/></worktitle>       <movement><xsl:value-of select="movement"/></movement>       <conductorname><xsl:value-of select="conductorname"/></conductorname>     </xsl:copy>   </xsl:template>  </xsl:transform> 

python script

import lxml.etree et import sqlite3  def insert_program(db_conn, id, programid, orchestra, season, concertinfo):     curs = db_conn.cursor()     curs.execute("insert program values (?,?,?,?,?,?)", (id, programid, orchestra, season, concertinfo))     db_conn.commit()  def program_data_from_element(element):     id = element.find("id").text     programid = element.find("programid").text     orchestra = element.find("orchestra").text     season = element.find("season").text     concertinfo = element.find("concertinfo").text      return id, programid, orchestra, season, concertinfo  def insert_work(db_conn, programid, workid, composername, worktitle, movement, conductorname):     curs = db_conn.cursor()     curs.execute("insert work values (?,?,?,?,?)", (programid, workid, composername, worktitle, movement, conductorname))     db_conn.commit()  def work_data_from_element(element):     programid = element.find("programid").text     workid = element.find("workid").text     composername = element.find("composername").text     worktitle = element.find("worktitle").text     movement = element.find("movement").text     conductorname = element.find("conductorname").text      return programid, workid, composername, worktitle, conductorname, movement      if __name__ == "__main__":         conn = sqlite3.connect("program.sqlite3")     xml = et.parse("complete.xml")      # program parse     xslt = et.parse("program.xsl")     transform = et.xslt(xslt)     newdom = transform(xml)     program = newdom.xpath("//program")          index, element in enumerate(program):         id, programid, orchestra, season, concertinfo = program_data_from_element(element)         insert_program(conn, id, programid, orchestra, season, concertinfo)      # work parse     xslt = et.parse("work.xsl")     transform = et.xslt(xslt)     newdom = transform(xml)     work = newdom.xpath("//work")          index, element in enumerate(work):         programid, workid, composername, worktitle, conductorname, movement = work_data_from_element(element)         insert_work(conn, programid, workid, composername, worktitle, conductorname, movement) 

Comments

Popular posts from this blog

matlab - error with cyclic autocorrelation function -

django - (fields.E300) Field defines a relation with model 'AbstractEmailUser' which is either not installed, or is abstract -

c# - What is a good .Net RefEdit control to use with ExcelDna? -