Sunday, December 18, 2011

Using nltk with Python 3 (11)

And on to chapter 11: Managing Linguistic Data
It would be nice if the book could be updated some time in the future, as some namespaces are no longer correct. It works, though, sometimes it is just a matter of finding the correct name.
>>> timitdict = nltk.corpus.timit.transcription_dict()
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\corpus\reader\timit.py", line 233, in transcription_dict
    m = re.match(r'\s*(\S+)\s+/(.*)/\s*$', line)
  File "...\lib\re.py", line 153, in match
    return _compile(pattern, flags).match(string)
TypeError: can't use a string pattern on a bytes-like object
Almost looks like an old friend now. Open corpus/reader/timit.py, add before line #233:
try:
    line = line.decode('utf-8','ignore')
except:
    pass
Add before line #269:
try:
    line = line.decode('utf-8','ignore')
except:
    pass

>>> lexicon = toolbox.xml('rotokas.dic')
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\toolbox\toolbox.py", line 72, in raw_fields
    mkr, line_value = mobj.groups()
AttributeError: 'NoneType' object has no attribute 'groups'
Well, this has proven to be quite an adventure. In hindsight, it is, of course, simple. Basically, the regexes do not work any more and newlines are not correctly considered, so the tree does not get filled. This is the beginning of raw_fields(self) in toolbox/toolbox.py, beginning at line #58. Just replace the whole function. Not everything has changed, of course, but this should prove easier.
join_string = '\n'
line_regexp = b'^%s(?:\\(\S+)\s*)?(.*)$'
# discard a BOM in the first line
first_line_pat = re.compile(b'^(?:\\ufeff)?(?:\\\\(\\S+)\\s*)?(.*)$')
line_pat = re.compile(b'^(?:\\\\(\\S+)\\s*)?(.*)$')
# need to get first line outside the loop for correct handling
# of the first marker if it spans multiple lines
file_iter = iter(self._file)
line = next(file_iter)
patterntomatch = '\_sh'
mobj = re.match(first_line_pat, line)
mkr, line_value = mobj.groups()
mkr = mkr.decode('utf-8','ignore')
line_value = line_value.decode('utf-8','ignore')
value_lines = [line_value,]
self.line_num = 0
for line in file_iter:
    line = line.replace(b'\n',b'')
    self.line_num += 1
    mobj = re.match(line_pat, line)
    try:
        line_mkr, line_value = mobj.groups()
        line_mkr = line_mkr.decode('utf-8','ignore')
        line_value = line_value.decode('utf-8','ignore')
    except AttributeError:
        line_mkr = False
        line_value = line.decode('utf-8','ignore')
    if line_mkr:
        yield (mkr, join_string.join(value_lines))
        mkr = line_mkr
        value_lines = [line_value,]
    else:
        value_lines.append(line_value)
self.line_num += 1
yield (mkr, join_string.join(value_lines))
And def fields(...): in the same file, beginning at line #119. Just replace the whole function, even though the change is minimal.
if encoding is None and unicode_fields is not None:
    raise ValueError('unicode_fields is set but not encoding.')
unwrap_pat = re.compile(r'\n+')
for mkr, val in self.raw_fields():
    if encoding:
        if unicode_fields is not None and mkr in unicode_fields:
            val = val.decode('utf8', errors)
        else:
            try:
                val = val.decode(encoding, errors)
            except:
                pass
        try:
            mkr = mkr.decode(encoding, errors)
        except:
            pass
    if unwrap:
        val = unwrap_pat.sub(' ', val)
    if strip:
        val = val.rstrip()
    yield (mkr, val)
That was that.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.