Atropine Documentation

written on 2005-10-20 by Moe Aboulkheir (moe@divmod.com)

go back to the main page

please note, using this library is not as complicated as it sounds, it consists of only 275 lines of python,
which is several orders of magnitude shorter than this documentation.

Ideas

Examples

here is a simple example session:

from atropine import go, check, special
from atropine.atropine import Atropine
import re

atropine = Atropine('''
  <!-- snip -->
  <table id="earningsTable">
    <tbody>
      <tr>
        <td class="headerTableCell">
          Quarterly Earnings
        </td>
        <td class="dataTableCell">
          <span class="unhelpfulClassName">GBP</span>
          <span class="unhelpfulClassName">123.45</span>
        </td>
      </tr>
    </tbody>
  </table>''', ignorewhitespace=True)

qearningsregex = re.compile(r'quarterly earnings', re.IGNORECASE)
atropine = atropine.resolve(go.only(tag='table', attrs=dict(id='earningsTable')),

                            go.child(0), check.has(tag='tbody'),
                            go.child(0), check.has(tag='tr'),
                            go.child(0), check.has(tag='td',
                                                   cls='headerTableCell',
                                                   onlytext=qearningsregex),

                            go.nextsib,  check.has(tag='td', cls='dataTableCell'),

                            special.collect('earnings-info', alltext=True))

(currency, amount) = atropine.collection['earnings-info']
amount = int(float(amount) * 100)
# store these variables somewhere

Reference

Atropine(html, ignorewhitespace=True)
just like BeautifulSoup, html can be a string or a file like object. the ignorewhitespace argument determines whether text nodes consisting only of white space characters should be considered.

Atropine.soup
instance variable representing the underlying BeautifulSoup instance. this will generally be the result of parsing the html passed to Atropine.

Atropine.current

instance variable that represents the current tag (as a BeautifulSoup.Tag instance). this will only be set to something sensible when an Atropine.resolve call is underway.

Atropine.registerchecker(name, function)

this method does pretty much what the signature says - registers the checker function function under the name name. the null resolvers section talks about checker functions.

Atropine.getchecker(name)

returns the checker function associated with name

Atropine.istextnode(tag)

static method - returns a boolean indicating whether the BeautifulSoup.Tag instance tag is a text node

Atropine.onlytext(tag)

returns the contents of tag, if it contains only one element, which also happens to be a text node. otherwise it explodes

Atropine.assimilate(tag)

this method sets the value of self.current to tag, while asserting that whatever it is passed is a sane value. it is typically called only by directional resolvers.

Atropine.resolve(resolver, [resolver, ...])

resolve takes any number of callables as its arguments. these callables generally locate some node in the document, and set the current node in the associated Atropine instance, so they are called resolvers. there are a bunch of built in resolvers - directional resolvers live in the atropine.go module, and null resolvers live in the atropine.check module. resolvers that do weird things live in atropine.special.

a null resolver is defined as any resolver that asserts some stuff about the current node, but doesn't change it - conversely, a directional resolver is one which locates some node and sets it as the current one.

resolve returns a new Atropine instance that represents the current tag at the end of the resolve call

Directional Resolvers

go.child(n)
set the current tag to the nth child of the current tag

go.only(tag=None, cls=None, attrs=None)
assert that the current tag only has one child that meets the given criteria, and set that tag to the current one. cls expands to 'class', which is a reserved word in python. all of the keyword arguments can be strings or sequences of strings, and the values in the attrs dictionary can be strings, sequences of strings, regular expression objects or functions which accept strings and return a boolean.

examples:

go.only(cls=('textbox-container', 'button-container'))
will assert that the current tag only has one child whose 'class' attribute has a value of either 'textbox-container' or 'button-container', and will set that child as the current tag

go.only(tag='tr', attrs=dict(id='123'))
will assert that the current tag only has one child with a tag name of 'tr' and an attribute 'id' with the value '123', and will set the current tag to this tag.

go.only()
will assert the current tag only has one child, and will set the current tag to this tag.

go.parent(n)
set the current tag to the nth parent of the current tag.

go.prevsib
set the current tag to the previous sibling of the current tag. this is a predicate resolver, and so does not take any user-supplied arguments

go.nextsib
same as go.prevsib, but sets the current tag to the next sibling of the current tag

go.nth(n, tag=None, cls=None, attrs=None)
set the current tag to the nth child of the current tag which meets the given criteria. the keyword arguments are the same as go.only()

Writing Your Own Directional Resolver

def randomchild(atropine):
  # Atropine.assimilate is identical to assigning to
  # atropine.current, but it asserts it argument is not
  # BeautifulSoup.Null or None

  atropine.assimilate(random.choice(atropine.current.contents))

# then use it just as you would any other resolver
atropine.resolve(randomchild)

Null Resolvers

check.has(**k)
check that the current tag meets all of the given criteria

check.doesnthave(**k)
inverse of check.has()

The arguments accepted by check.has and check.doesnthave are keyword arguments that name "checker" functions - checkers are functions that accept two arguments - an Atropine instance, as well as the value of the keyword argument. as checkers are expected to return a boolean indicating matchingness, all check.has does is call each checker in turn, raising an exception if any one of them returns False. check.doesnthave does the same, but explodes if any checker returns True.

Writing Your Own Checkers

def ntextnodes(atropine, n):
  #(check.equal is a utility function equal(x, y) that returns
  # x == y if y is not a sequence, or x in y, if y is a sequence)
  return check.equal(len(t for t in atropine.current.contents
                              if atropine.istextnode(t)), n)

atropine.registerchecker('ntextnodes', ntextnodes)

# you can now use this like so:
atropine.resolve(check.has(tag='td', ntextnodes=4))
atropine.resolve(check.has(tag='td', ntextnodes=(1, 2, 3, 4)))

Simple Checkers

all simple checkers accept a string or integer, or a sequence of strings or integers.

indexonparent

assert that the current tag is (or is not) the nth child of its parent tag (starting from 0)

id
assert that the current tag has (or doesnt have) an id attribute with the given value

cls
same as id, but checks the 'class' attribute

tag
assert that the current tag has (or doesnt have) the given tag name

examples:

check.has(id='something')
will match '<anything id="something">'

check.has(id=('something', 'something_else'))
will match '<anything id="something">' and '<anything id="something_else">

check.doesnthave(id='something')
will match everything that doesnt have an id of 'something'

Not So Simple Checkers

attrs
accepts a dictionary of attributename:attributevalue, and checks that the attributes of the current tag match (or dont match) the given attribute values. note that only the given attributes are checked, e.g. check.has(tag='td', attrs=dict(x='x', y='y')) will match '<td x="x" y="y" somethingunrelated="abc">'. the values in the dictionary can be either strings, sequences of strings, regular expression objects or functions that take a string and a return a boolean

allchildren
accepts a function and asserts that it returns True across all children of the current tag

onlytext
assert that the current tag has only one child node, which is a text node, whose contents match (or dont match, if you are using check.doesnthave) the given value. the value can be a string, a regular expression object, or a callable that returns a boolean

examples:

check.has(tag='span', onlytext='abc')
will match '<span>abc</span>'

check.has(tag='span', onlytext=re.compile('\d+,\d+,\d+'))
will match '<span>1,2,3</span>' and '<span>3,2,1</span>', etc

check.has(tag='span', onlytext=lambda text: True)
will match a span element that contains any one text node, no matter what characters it is composed of

alltext
probably useless - it checks that all of the text nodes that are children of the current node match (or dont match) the given value

examples:

check.has(tag='span', alltext='HELLO!')
will match '<span><b>HELLO!</b><b>HELLO!</b></span>'

check.has(tag='span', alltext=lambda text: text.startswith('H'))
will match '<span><b>HELLO!</b><b>HOWDY</b></span>', etc

check.doesnthave(alltext=re.compile('cheese'))
will match any element which doesn't have any descendant text nodes that contain 'cheese'

Special Resolvers

special.collect(keyname, alltext=False, onlytext=False)
if alltext is True, all descendant text nodes of the current element will be stored in a list under the key keyname in a dictionary which can be accessed via the collection instance attribute of the Atropine instance returned by the current resolve call. the same goes for onlytext, except it will be asserted that the current node contains only one element (which is a text node) and the value of that node will be stored (as a string).