Python 2.x Python XML Parsing
XML stands for eXtensible Markup Language. You can learn through our website's XML Tutorial
XML is designed to transport and store data.
XML is a set of rules that define semantic markup, which divides documents into many parts and identifies these parts.
It is also a meta-markup language, which defines the syntax language used to define other domain-specific, semantic, structured markup languages.
Python XML Parsing
Common XML programming interfaces include DOM and SAX. These two interfaces handle XML files differently and are suitable for different occasions.
Python has three methods to parse XML: SAX, DOM, and ElementTree:
1. SAX (simple API for XML)
The Python standard library contains a SAX parser. SAX uses an event-driven model, triggering various events during the parsing of XML and calling user-defined callback functions to handle XML files.
2. DOM (Document Object Model)
Parses XML data into a tree in memory, operating on XML by manipulating the tree.
3. ElementTree
ElementTree is like a lightweight DOM with convenient and friendly APIs. It has good code availability, fast speed, and consumes less memory.
Note: Since DOM needs to map XML data to a tree in memory, it is both slow and memory-intensive. SAX reads XML files in a streaming manner, which is faster and uses less memory, but requires users to implement callback functions (handlers).
The XML example file movies.xml used in this chapter contains the following content:
movies.xml
War, ThrillerDVD2003PG10Talk about a US-Japan warAnime, Science FictionDVD1989R8A schientific fictionAnime, ActionDVD4PG10Vash the Stampede!ComedyVHSPG2Viewable boredomPython Uses SAX to Parse XML
SAX is an event-driven API.
Using SAX to parse XML documents involves two parts: Parser and Event Handler.
The parser is responsible for reading the XML document and sending events to the event handler, such as element start and element end events.
The event handler is responsible for responding to events and processing the passed XML data.
- 1. Processing large files;
- 2. Only needing part of the file's content, or obtaining specific information from the file.
- 3. When you want to build your own object model.
In Python, using the SAX method to process XML requires importing the parse function from xml.sax and ContentHandler from xml.sax.handler.
Introduction to ContentHandler Class Methods
characters(content) method
Call timing:
- From the beginning of a line until before encountering a tag, if there are characters, the value of content will be these strings.
- From one tag until before encountering the next tag, if there are characters, the value of content will be these strings.
- From one tag until before encountering a line break, if there are characters, the value of content will be these strings.
The tags can be start tags or end tags.
startDocument() method
Called when the document starts.
endDocument() method
Called when the parser reaches the end of the document.
startElement(name, attrs) method
Called when encountering an XML start tag. name is the tag name, and attrs is a dictionary of tag attribute values.
endElement(name) method
Called when encountering an XML end tag.
make_parser Method
The following method creates a new parser object and returns it.
xml.sax.make_parser()
Parameter Description:
- parser_list - Optional parameter, parser list
parser Method
The following method creates a SAX parser and parses the XML document:
xml.sax.parse(xmlfile, contenthandler[, errorhandler])
Parameter Description:
- xmlfile - XML filename
- contenthandler - Must be a ContentHandler object
- errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler object
parseString Method
The parseString method creates an XML parser and parses the XML string:
xml.sax.parseString(xmlstring, contenthandler[, errorhandler])
Parameter Description:
- xmlstring - XML string
- contenthandler - Must be a ContentHandler object
- errorhandler - If this parameter is specified, errorhandler must be a SAX ErrorHandler object
Python Parses XML Example
Example
import xml.sax class MovieHandler(xml.sax.ContentHandler): def __init__ (self): self.CurrentData = ""self.type = ""self.format = ""self.year = ""self.rating = ""self.stars = ""self.description = ""def startElement(self, tag, attributes): self.CurrentData = tag if tag == "movie": print" *****Movie***** "title = attributesprint"Title:", title def endElement(self, tag): if self.CurrentData == "type": print"Type:", self.type elif self.CurrentData == "format": print"Format:", self.format elif self.CurrentData == "year": print"Year:", self.year elif self.CurrentData == "rating": print"Rating:", self.rating elif self.CurrentData == "stars": print"Stars:", self.stars elif self.CurrentData == "description": print"Description:", self.description self.CurrentData = ""def characters(self, content): if self.CurrentData == "type": self.type = content elif self.CurrentData == "format": self.format = content elif self.CurrentData == "year": self.year = content elif self.CurrentData == "rating": self.rating = content elif self.CurrentData == "stars": self.stars = content elif self.CurrentData == "description": self.description = content if( __name__ == " __main__ "): parser = xml.sax.make_parser()parser.setFeature(xml.sax.handler.feature_namespaces, 0)Handler = MovieHandler()parser.setContentHandler(Handler)parser.parse("movies.xml")
The execution result of the above code is as follows:
***** Movie ***** Title: Enemy BehindType: War, ThrillerFormat: DVD Year: 2003Rating: PG Stars: 10Description: Talk about a US-Japan war ***** Movie ***** Title: TransformersType: Anime, Science FictionFormat: DVD Year: 1989Rating: R Stars: 8Description: A schientific fiction ***** Movie ***** Title: TrigunType: Anime, ActionFormat: DVD Rating: PG Stars: 10Description: Vash the Stampede! ***** Movie ***** Title: IshtarType: ComedyFormat: VHS Rating: PG Stars: 2Description: Viewable boredom
For complete SAX API documentation, please refer to Python SAX APIs
Using xml.dom to Parse XML
The Document Object Model (DOM) is a standard programming interface recommended by the W3C organization for processing extensible markup languages.
A DOM parser reads the entire document at once when parsing an XML document, saving all elements of the document in a tree structure in memory. Then you can use different functions provided by DOM to read or modify the content and structure of the document, or write the modified content back to an XML file.
In Python, xml.dom.minidom is used to parse XML files. The example is as follows:
Example
from xml.dom.minidom import parse import xml.dom.minidom DOMTree = xml.dom.minidom.parse("movies.xml")collection = DOMTree.documentElement if collection.hasAttribute("shelf"): print"Root element : %s" % collection.getAttribute("shelf")movies = collection.getElementsByTagName("movie")for movie in movies: print" *****Movie***** "if movie.hasAttribute("title"): print"Title: %s" % movie.getAttribute("title")type = movie.getElementsByTagName('type')print"Type: %s" % type.childNodes.data format = movie.getElementsByTagName('format')print"Format: %s" % format.childNodes.data rating = movie.getElementsByTagName('rating')print"Rating: %s" % rating.childNodes.data description = movie.getElementsByTagName('description')print"Description: %s" % description.childN
YouTip