Introduction to Functional Web Testing With Twill & Selenium

Part 1 :: Extra Time :: Parsing with Beautiful Soup

Synopsis

One of the most important tools you'll need when writing tests with twill is a parser to scrape the contents of the pages that you'll be introspecting with your tests. To verify functionality from the end-user's perspective, you'll likely need to assert that specific pieces of content are being displayed. One way to do this is to scan the page content for whatever the acceptance criteria are using regular expressions. A better way is to break the page components apart and look at the contents of a single cell or div. This is where BeautifulSoup fits the bill.

Extra Time

Beautiful Soup

BeautifulSoup is an HTML/XML parser that turns your page content into an objectified hierarchy. Trust us, you don't want to end up maintaining a nasty collection of regular expressions to get to the content you need. Instead, access content by its location and identity. If you're not maintaining the templates in the application, this is a great time to collaborate with those who are so that your job is even easier.

Installing it

To start using BeautifulSoup, you can use easy_install to grab it from PyPi. (Mac users: don't forget to use sudo)

              kevins-macbook:~ kevin$ sudo easy_install BeautifulSoup
              Searching for BeautifulSoup
              Reading http://pypi.python.org/simple/BeautifulSoup/
              Reading http://www.crummy.com/software/BeautifulSoup/
              Reading http://www.crummy.com/software/BeautifulSoup/download/
              Best match: BeautifulSoup 3.1.0.1
              Downloading http://www.crummy.com/software/BeautifulSoup/download/BeautifulSoup-3.1.0.1.tar.gz
              Processing BeautifulSoup-3.1.0.1.tar.gz
              Running BeautifulSoup-3.1.0.1/setup.py -q bdist_egg --dist-dir /tmp/easy_install-JYLoUB/BeautifulSoup-3.1.0.1/egg-dist-tmp-mbTUsh
              Adding BeautifulSoup 3.1.0.1 to easy-install.pth file
              Installing testall.sh script to /usr/local/bin
              Installing to3.sh script to /usr/local/bin
              
              Installed /Library/Python/2.5/site-packages/BeautifulSoup-3.1.0.1-py2.5.egg
              Processing dependencies for BeautifulSoup
              Finished processing dependencies for BeautifulSoup
              kevins-macbook:~ kevin$

This will install it in your Python distribution's site-packages directory.

Using it

Consider the following HTML:

             
               
                   
                       My Awesome Page
                   
               
               
                   
                       
                           
                               This is paragraph one .
                           
                       
                       
                           
                               This is paragraph two .

From the BeautifulSoup docs: "A Beautiful Soup constructor takes an HTML (or XML) document in the form of a string (or an open file-like object). It parses the document and creates a corresponding data structure in memory." Let's go ahead and give that a shot. Assuming that we've created a string object named html with the contents of our page:

             from BeautifulSoup import BeautifulSoup
             soup = BeautifulSoup(html)

             # The 'contents' attr is a list of elements within the page
             print soup.contents[0].name
             Out[1]: html

             # Since we're at the document level, we only see the html element. Everything else is contained within:
             doc = soup.contents[0]
             head = doc.contents[0]
             body = doc.contents[1]

Now that we have an idea of how this object is structured, let's walk the parse tree!

print body.contents Out[2]: [

This is paragraph one.

This is paragraph two.

] print body.findChildren() # This will find all child elements, even nested ones! Out[3]: [

This is paragraph one.

This is paragraph two.

This is paragraph one.

, one,

This is paragraph two.

, two] print soup.body.div.div.p.b # Walk the nested tree by element name Out[4]: one print soup.body.div.div.p.b.string # The string attr on an element will give you the unicode representation Out[5]: u'one'

Good stuff, but should we be expected to modify our tests every time a trigger-happy production person throws in an extra div? Of course not! That's why the search mechanism is so important. Witness the sheer power of the findAll method:

             soup.findAll('p') # Pass in an element name in a string
             Out[6]: 
             [This is paragraph one.,
              This is paragraph two.]
            
             soup.findAll(['p', 'b']) # Or, pass in a list of element names to find
             Out[7]: 
             [This is paragraph one.,
              one,
              This is paragraph two.,
              two]
            
             soup.findAll(align="meh") # Want to search by attribute?
             Out[8]: [This is paragraph two.]
            
             soup.find("div", { "class" : "b" }) # For attribute searches that are reserved words
             Out[9]: This is paragraph two.
            
             soup.find("p", { "id" : "two" }) # Or just because...
             Out[10]: This is paragraph two.

Now you can see how important it is to communicate with the production team, so that you can create proprietary attributes on elements to make them more accessible. Also, combining a parser like this with a base Page Class makes it that much more powerful!

Fluid 960 Grid System, created by Stephen Bau, based on the 960 Grid System by Nathan Smith. Released under the GPL/ MIT Licenses.