python - Parsing hierarchical information from XML to sqlite -
the scripts post long beg patience. believe easy solve people familiar complexity of xml structure. have highly hierarchical xml file. love make sqlite database based on it. 2 files have far extract sibling information xml file structure (note - frequency of work varies 0 4 or 5; nodes can empty):
<program> <id>28798031</id> <programid>12345</programid> <orchestra>new york philarmonic</orchestra> <season>1842-43</season> <concertinfo> <eventtype>subscription season</eventtype> <location>manhattan, ny</location> <venue>apollo rooms</venue> <date>1842-12-07t05:00:00z</date> <time>8:00pm</time> </concertinfo> <worksinfo> <work id="52446*"> <composername>beethoven, ludwig van</composername> <worktitle>symphony no. 5 in c minor, op.67</worktitle> <conductorname>hill, ureli corelli</conductorname> </work> <work id="8834*4"> <composername>weber, carl maria von</composername> <worktitle>oberon</worktitle> <movement>"ozean, du ungeheuer" (ocean, thou mighty monster), reiza (scene , aria), act ii</movement> <conductorname>timm, henry c.</conductorname> <soloists> <soloist> <soloistname>otto, antoinette</soloistname> <soloistinstrument>soprano</soloistinstrument> <soloistroles>s</soloistroles> </soloist> </soloists> </work> </worksinfo> </program>
i have 2 scripts extract data sibling nodes (the same hierarchical level).
first
import sqlite3 import xml.etree.celementtree def insert_program(db_conn, id, programid, orchestra, season, concertinfo, worksinfo): curs = db_conn.cursor() curs.execute("insert program values (?,?,?,?,?,?)", (id, programid, orchestra, season, concertinfo, worksinfo)) db_conn.commit() def program_data_from_element(element): id = element.find("id").text programid = element.find("programid").text orchestra = element.find("orchestra").text season = element.find("season").text concertinfo = element.find("concertinfo").text worksinfo = element.find("worksinfo").text return id, programid, orchestra, season, concertinfo, worksinfo ## add main loop programs xml file if __name__ == "__main__": conn = sqlite3.connect("program.sqlite3") program = xml.etree.celementtree.parse("complete.xml") program = program.findall("program") index, element in enumerate(program): id, programid, orchestra, season, concertinfo, worksinfo = program_data_from_element(element) insert_program(conn, id, programid, orchestra, season, concertinfo, worksinfo)
second
import sqlite3 import xml.etree.celementtree def insert_work(db_conn, workid, composername, worktitle, movement, conductorname): curs = db_conn.cursor() curs.execute("insert work values (?,?,?,?,?)", (workid, composername, worktitle, movement, conductorname)) db_conn.commit() def work_data_from_element(element): workid = element.get("id") if workid != none: workid = workid else: workid = '' composername = element.find("composername") if composername != none: composername = composername.text else: composername = '' worktitle = element.find("worktitle") if worktitle != none: worktitle = worktitle.text else: worktitle = '' movement = element.find("movement") if movement != none: movement = movement.text else: movement = '' conductorname = element.find("conductorname") if conductorname != none: conductorname = conductorname.text else: conductorname = '' return workid, composername, worktitle, movement, conductorname ## add main loop work information xml file if __name__ == "__main__": conn = sqlite3.connect("work.sqlite3") programs = xml.etree.celementtree.parse("complete.xml") work = programs.findall("program/worksinfo/work") index, element in enumerate(work): workid, composername, worktitle, movement, conductorname = work_data_from_element(element) insert_work(conn, workid, composername, worktitle, movement, conductorname)
my issue -- how connect 2 can know programid, season, workid , conductorname, etc., together? thank you!!
whenever working nested, hierarchical xml files need flattened two-dimensional formats database tables, consider xslt solution. information, xslt special-purpose programming language designed transform xml files , general purpose languages, python can process xslt 1.0 scripts, using lxml module.
so both program , work sqlite tables, consider running following xslt scripts transform source xml file , parse xpath (sibling xslt) database import in 1 python script. way, xslt well-formed xml file can loaded file or string other xml. beauty of xslt can hard-code elements may/may not exist returns empty text missing nodes.
a few changes structure:
- concatenated children of
<concerntinfo>
(maybe use separate table?) - removed
<worksinfo>
program parsing since reflects in work table - added corresponding program
<id>
work parsing use foreign key in table
program xslt (save .xsl file called in python)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="1.0"> <xsl:output version="1.0" encoding="utf-8" indent="yes" /> <xsl:strip-space elements="*"/> <xsl:template match="/"> <program> <xsl:apply-templates select="program"/> </program> </xsl:template> <xsl:template match="program"> <id><xsl:value-of select="id"/></id> <programid><xsl:value-of select="programid"/></programid> <orchestra><xsl:value-of select="orchestra"/></orchestra> <season><xsl:value-of select="season"/></season> <concertinfo><xsl:value-of select="concat(concertinfo/eventtype, ' ', concertinfo/location, ' ', concertinfo/venue, ' ', concertinfo/date, ' ', concertinfo/time)"/></concertinfo> </xsl:template> </xsl:transform>
work xslt (save .xsl called in python)
<xsl:transform xmlns:xsl="http://www.w3.org/1999/xsl/transform" version="1.0"> <xsl:output version="1.0" encoding="utf-8" indent="yes" /> <xsl:strip-space elements="*"/> <xsl:template match="program"> <data> <xsl:apply-templates select="worksinfo"/> </data> </xsl:template> <xsl:template match="worksinfo"> <xsl:apply-templates select="work"/> </xsl:template> <xsl:template match="work"> <xsl:copy> <programid><xsl:value-of select="ancestor::program/id"/></programid> <workid><xsl:value-of select="@workid"/></workid> <composername><xsl:value-of select="composername"/></composername> <worktitle><xsl:value-of select="worktitle"/></worktitle> <movement><xsl:value-of select="movement"/></movement> <conductorname><xsl:value-of select="conductorname"/></conductorname> </xsl:copy> </xsl:template> </xsl:transform>
python script
import lxml.etree et import sqlite3 def insert_program(db_conn, id, programid, orchestra, season, concertinfo): curs = db_conn.cursor() curs.execute("insert program values (?,?,?,?,?,?)", (id, programid, orchestra, season, concertinfo)) db_conn.commit() def program_data_from_element(element): id = element.find("id").text programid = element.find("programid").text orchestra = element.find("orchestra").text season = element.find("season").text concertinfo = element.find("concertinfo").text return id, programid, orchestra, season, concertinfo def insert_work(db_conn, programid, workid, composername, worktitle, movement, conductorname): curs = db_conn.cursor() curs.execute("insert work values (?,?,?,?,?)", (programid, workid, composername, worktitle, movement, conductorname)) db_conn.commit() def work_data_from_element(element): programid = element.find("programid").text workid = element.find("workid").text composername = element.find("composername").text worktitle = element.find("worktitle").text movement = element.find("movement").text conductorname = element.find("conductorname").text return programid, workid, composername, worktitle, conductorname, movement if __name__ == "__main__": conn = sqlite3.connect("program.sqlite3") xml = et.parse("complete.xml") # program parse xslt = et.parse("program.xsl") transform = et.xslt(xslt) newdom = transform(xml) program = newdom.xpath("//program") index, element in enumerate(program): id, programid, orchestra, season, concertinfo = program_data_from_element(element) insert_program(conn, id, programid, orchestra, season, concertinfo) # work parse xslt = et.parse("work.xsl") transform = et.xslt(xslt) newdom = transform(xml) work = newdom.xpath("//work") index, element in enumerate(work): programid, workid, composername, worktitle, conductorname, movement = work_data_from_element(element) insert_work(conn, programid, workid, composername, worktitle, conductorname, movement)
Comments
Post a Comment