Atropine Documentation

written on 2005-10-20 by Moe Aboulkheir (moe@divmod.com)

please note, using this library is not as complicated as it sounds, it consists of only 275 lines of python,
which is several orders of magnitude shorter than this documentation.

Ideas

It is better to get no data than to get the wrong data
The key to screen-scraping the right data is to make a painful amount of assertions about document structure

Examples

here is a simple example session:

from atropine import go, check, special from atropine.atropine import Atropine import re atropine = Atropine('''  <table id="earningsTable"> <tbody> <tr> <td class="headerTableCell"> Quarterly Earnings </td> <td class="dataTableCell"> GBP 123.45 </td> </tr> </tbody> </table>''', ignorewhitespace=True) qearningsregex = re.compile(r'quarterly earnings', re.IGNORECASE) atropine = atropine.resolve(go.only(tag='table', attrs=dict(id='earningsTable')), go.child(0), check.has(tag='tbody'), go.child(0), check.has(tag='tr'), go.child(0), check.has(tag='td', cls='headerTableCell', onlytext=qearningsregex), go.nextsib, check.has(tag='td', cls='dataTableCell'), special.collect('earnings-info', alltext=True)) (currency, amount) = atropine.collection['earnings-info'] amount = int(float(amount) * 100) # store these variables somewhere

Atropine(html, ignorewhitespace=True)

    just like BeautifulSoup, html
    can be a string or a file like object. the ignorewhitespace argument determines whether
    text nodes consisting only of white space characters should be considered.
Atropine.soup

instance variable representing the underlying BeautifulSoup instance.  this will generally
be the result of parsing the html passed to Atropine.

Atropine.current

instance variable that represents the current tag (as a BeautifulSoup.Tag instance).  this will
only be set to something sensible when an Atropine.resolve call
is underway.

Atropine.registerchecker(name, function)

this method does pretty much what the signature says - registers the checker function function under the name
name.  the null resolvers section talks about checker functions.

Atropine.getchecker(name)

returns the checker function associated with name

Atropine.istextnode(tag)

static method - returns a boolean indicating whether the BeautifulSoup.Tag
instance tag is a text node

Atropine.onlytext(tag)

returns the contents of tag, if it contains only one element, which also happens to be a text node.
otherwise it explodes

Atropine.assimilate(tag)

this method sets the value of self.current
to tag, while asserting that whatever it is passed is a sane value.  it is typically
called only by directional resolvers.

    Atropine.resolve(resolver, [resolver, ...])

resolve takes any number of callables as its arguments.  these callables generally
    locate some node in the document, and set the current node in the associated Atropine
    instance, so they are called resolvers.  there are a bunch of built in resolvers -
    directional resolvers live in the atropine.go module, and null resolvers live in
    the atropine.check module.  resolvers that do weird things live in atropine.special.

    a null resolver is defined as any resolver that asserts some stuff about the current node,
    but doesn't change it - conversely, a directional resolver is one which locates some node
    and sets it as the current one.

    resolve returns a new Atropine instance that
    represents the current tag at the end of the resolve call

Directional Resolvers
    
go.child(n)

    set the current tag to the nth child of the current tag
    
go.only(tag=None, cls=None, attrs=None)

        assert that the current tag only has one child that meets the given criteria,
        and set that tag to the current one.  cls expands to 'class', which is  a reserved
        word in python.  all of the keyword arguments can be strings or sequences of strings,
        and the values in the attrs dictionary can be strings, sequences of strings, regular
        expression objects or functions which accept strings and return a boolean.
        
examples:
        
go.only(cls=('textbox-container', 'button-container'))

            will assert that the current tag only has one child whose 'class' attribute
            has a value of either 'textbox-container' or 'button-container', and will
            set that child as the current tag
            
go.only(tag='tr', attrs=dict(id='123'))

            will assert that the current tag only has one child with a tag name of 'tr'
            and an attribute 'id' with the value '123', and will set the current tag to this tag.
            
go.only()

            will assert the current tag only has one child, and will set the current tag to this tag.
            
go.parent(n)

            set the current tag to the nth parent of the current tag.
            
go.prevsib

        set the current tag to the previous sibling of the current tag.
        this is a predicate resolver, and so does not take any user-supplied arguments
        
go.nextsib

        same as go.prevsib, but sets the current tag to the next sibling of the current tag
go.nth(n, tag=None, cls=None, attrs=None)

        set the current tag to the nth child of the current tag which meets the given criteria.  the keyword
        arguments are the same as go.only()
        
Writing Your Own Directional Resolver
        
              def randomchild(atropine):
  # Atropine.assimilate is identical to assigning to
  # atropine.current, but it asserts it argument is not
  # BeautifulSoup.Null or None

  atropine.assimilate(random.choice(atropine.current.contents))

# then use it just as you would any other resolver
atropine.resolve(randomchild)

Null Resolvers

check.has(**k)

check that the current tag meets all of the given criteria

check.doesnthave(**k)

inverse of check.has()

The arguments accepted by check.has and check.doesnthave are keyword arguments that name "checker" functions - checkers are functions that accept two arguments - an Atropine instance, as well as the value of the keyword argument. as checkers are expected to return a boolean indicating matchingness, all check.has does is call each checker in turn, raising an exception if any one of them returns False. check.doesnthave does the same, but explodes if any checker returns True.

Writing Your Own Checkers

def ntextnodes(atropine, n): #(check.equal is a utility function equal(x, y) that returns # x == y if y is not a sequence, or x in y, if y is a sequence) return check.equal(len(t for t in atropine.current.contents if atropine.istextnode(t)), n) atropine.registerchecker('ntextnodes', ntextnodes) # you can now use this like so: atropine.resolve(check.has(tag='td', ntextnodes=4)) atropine.resolve(check.has(tag='td', ntextnodes=(1, 2, 3, 4)))