It would be nice if the book could be updated some time in the future, as some namespaces are no longer correct. It works, though, sometimes it is just a matter of finding the correct name.
>>> timitdict = nltk.corpus.timit.transcription_dict() Traceback (most recent call last): File "...\lib\site-packages\nltk\corpus\reader\timit.py", line 233, in transcription_dict m = re.match(r'\s*(\S+)\s+/(.*)/\s*$', line) File "...\lib\re.py", line 153, in match return _compile(pattern, flags).match(string) TypeError: can't use a string pattern on a bytes-like objectAlmost looks like an old friend now. Open corpus/reader/timit.py, add before line #233:
try: line = line.decode('utf-8','ignore') except: passAdd before line #269:
try: line = line.decode('utf-8','ignore') except: pass
>>> lexicon = toolbox.xml('rotokas.dic') Traceback (most recent call last): File "...\lib\site-packages\nltk\toolbox\toolbox.py", line 72, in raw_fields mkr, line_value = mobj.groups() AttributeError: 'NoneType' object has no attribute 'groups'Well, this has proven to be quite an adventure. In hindsight, it is, of course, simple. Basically, the regexes do not work any more and newlines are not correctly considered, so the tree does not get filled. This is the beginning of raw_fields(self) in toolbox/toolbox.py, beginning at line #58. Just replace the whole function. Not everything has changed, of course, but this should prove easier.
join_string = '\n' line_regexp = b'^%s(?:\\(\S+)\s*)?(.*)$' # discard a BOM in the first line first_line_pat = re.compile(b'^(?:\\ufeff)?(?:\\\\(\\S+)\\s*)?(.*)$') line_pat = re.compile(b'^(?:\\\\(\\S+)\\s*)?(.*)$') # need to get first line outside the loop for correct handling # of the first marker if it spans multiple lines file_iter = iter(self._file) line = next(file_iter) patterntomatch = '\_sh' mobj = re.match(first_line_pat, line) mkr, line_value = mobj.groups() mkr = mkr.decode('utf-8','ignore') line_value = line_value.decode('utf-8','ignore') value_lines = [line_value,] self.line_num = 0 for line in file_iter: line = line.replace(b'\n',b'') self.line_num += 1 mobj = re.match(line_pat, line) try: line_mkr, line_value = mobj.groups() line_mkr = line_mkr.decode('utf-8','ignore') line_value = line_value.decode('utf-8','ignore') except AttributeError: line_mkr = False line_value = line.decode('utf-8','ignore') if line_mkr: yield (mkr, join_string.join(value_lines)) mkr = line_mkr value_lines = [line_value,] else: value_lines.append(line_value) self.line_num += 1 yield (mkr, join_string.join(value_lines))And def fields(...): in the same file, beginning at line #119. Just replace the whole function, even though the change is minimal.
if encoding is None and unicode_fields is not None: raise ValueError('unicode_fields is set but not encoding.') unwrap_pat = re.compile(r'\n+') for mkr, val in self.raw_fields(): if encoding: if unicode_fields is not None and mkr in unicode_fields: val = val.decode('utf8', errors) else: try: val = val.decode(encoding, errors) except: pass try: mkr = mkr.decode(encoding, errors) except: pass if unwrap: val = unwrap_pat.sub(' ', val) if strip: val = val.rstrip() yield (mkr, val)That was that.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.