I somehow managed to find myself writing an html parser for someone who wants to extract info from it for a bunch of scripts he's writing. The details of the scripts are unimportant, as is the parser (and we're not actually sure if we're allowed to use them on the game they're written for, so they may not even be written) but php is the implementation language of choice and the parser needs to do stuff that mere regexes can't handle in any sane way.

Since I couldn't find anything approaching a usable html parser in php and the html's crufty enough that the xml parser won't work, I decided on python and SGMLParser. Thus, writing the parser was muchly simplified but I needed a way to feed the data to php. The following code is the result of a couple of hours hard work followed by an epiphany and some debugging.

I still need to make sure my escape character handling isn't broken. This will require some thought and experimentation that I currently don't have time for. Any input is welcome.

No change on the escape character handling, although I should probably strip one layer of escaping, since it's python adding it in the first place. Also, I fixed a bug that didn't add empty lists/dicts to the array. It turns out that for some silly reason, array() == NULL returns true. So now I explicitly test for arrayness.

<?php
    function findMatchingBracket($text)
    {
        /*********************************************************************
        This function scans the given string for a closing bracket that
        matches the first character.  For any of (), [], {} and <> this will
        be the appropriate closing bracket, otherwise it will be the first
        recurrance of the given character.

        For all bracket types, brackets inside strings (delimited by "" and
        '') are ignored. For brackets that have different opening and closing
        characters it also handles nesting.
        *********************************************************************/
        $bracket = $text[0];
        switch ($bracket) {
            case '(': $endbr = ')'; break;
            case '[': $endbr = ']'; break;
            case '{': $endbr = '}'; break;
            case '<': $endbr = '>'; break;
            default: $endbr = $bracket;
        }
        $index = 1;
        $count = 0;
        $instr = false;
        while ($index < strlen($text)) {
            $char = $text[$index];
            if ($instr) {
                switch ($char) {
                    case '\\':
                        if (!$escape) {
                            $escape = true;
                        } else {
                            $escape = false;
                        }
                        break;
                    case $instr:
                        if (!$escape) {
                            $instr = false;
                        } else {
                            $escape = false;
                        }
                        break;
                    default:
                        $escape = false;
                }
            } else {
                switch ($char) {
                    case $endbr: $count--; break;
                    case $bracket: $count++; break;
                    case '\'':
                    case '"':
                        $instr = $char;
                        break;
                }
            }
            if ($count < 0) return $index;
            $index++;
        }
        return NULL;
    }

    function parsePythonTypes($text, $level)
    {
        /*********************************************************************
        This function takes in a string representation of a python
        nested-dicts-and-lists data structure and converts it to an equivalent
        php array.  For dicts, the array indexes are the dict keys.  For
        lists, the array indexes are zero-based numeric.
        *********************************************************************/
        switch ($text[0]) {
            case '{': $type = 'dict'; break;
            case '[': $type = 'list'; break;
            default: return 'Error: Unknown data type';
        }
        $text = substr($text, 1, findMatchingBracket($text)-1);
        $current = array();
        $index = 0;
        $key = 0;
        $value = NULL;
        $str = NULL;
        $instr = false;
        $escape = false;
        while ($index < strlen($text)) {
            $char = $text[$index];
            if ($instr) {
                switch ($char) {
                    case '\\':
                        if (!$escape) {
                            $escape = true;
                        } else {
                            $escape = false;
                        }
                        $str .= $char;
                        break;
                    case $instr:
                        if (!$escape) {
                            $instr = false;
                        } else {
                            $escape = false;
                            $str .= $char;
                        }
                        break;
                    default:
                        $str .= $char;
                        $escape = false;
                }
            } else {
                switch ($text[$index]) {
                    case '{':
                    case '[':
                        $substr = substr($text, $index);
                        $substr = substr($substr, 0,
                                findMatchingBracket($substr)+1);
                        $value = parsePythonTypes($substr, $level+1);
                        echo "value: ";
                        $index += strlen($substr)-1;
                        break;
                    case '\'':
                    case '"':
                        $instr = $char;
                        $str = '';
                        break;
                    case ':':
                        $key = $str;
                        break;
                    case ',':
                        if (!is_array($value) and $value == NULL) {
                            $value = $str;
                        }
                        $current[$key] = $value;
                        if ($type == 'list') $key++;
                        $value = NULL;
                        $str = NULL;
                        break;
                }
            }
            $index++;
        }
        if (is_array($value)) {
            $current[$key] = $value;
            if ($type == 'list') $key++;
        }
        elseif ($value == NULL) {
            $value = $str;
        }
        if ($value != NULL) {
            $current[$key] = $value;
            if ($type == 'list') $key++;
        }
        return $current;
    }
?>