Friday, December 16, 2011

Using nltk with Python 3 (2)

This time we will update nltk to fully work the code samples from chapter two.

The first error should be raised around the time the for loop is run:
for fileid in gutenberg.fileids():
    ...

The error is caused by corpus/reader/util.py. Change line #570 into
line = stream.readline().decode('utf-8','ignore')

The next error occurs while opening the other language corpora.
nltk.corpus.cess_esp.words()
File "...\lib\site-packages\nltk\corpus\reader\util.py", line 621, in read_regexp_block
    line = line.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 31: invalid continuation byte
Not really surprising, same file, same error. Change line #621 into
line = line.decode('utf-8','ignore')
While you are at it, change line #610
line = line.decode('utf-8','ignore')

Next error, next line:
nltk.corpus.indian.words('hindi.pos')
File "...\lib\site-packages\nltk\corpus\reader\indian.py", line 76, in read_block
    if line.startswith('<'):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
The error message seems to be a bug in Python, as it is exactly the other way around. But still, no problems here, add after line #75 in indian.py
try:
    line = line.decode()
except AttributeError:
    pass

Now we can go on until the Wordlist Corpora section.
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))
File "...\lib\site-packages\nltk\corpus\reader\util.py", line 461, in concat
   raise ValueError("Don't know how to concatenate types: %r" % types)
ValueError: Don't know how to concatenate types: {<class 'bytes'>}
This one is a bit tricky. Still, no worries. Add after line #423 in util.py, so it looks like this:
newdocs = []
try:
    for item in docs:
        newdocs.append(item.decode('utf-8','ignore'))
    docs = newdocs
except AttributeError:
    pass
types = set([d.__class__ for d in docs])

The stopwords example seems not to work correctly, but as it is the same as in Python 2, I will not touch it for now.
A Pronouncing Dictionary raises the next error:
entries = nltk.corpus.cmudict.entries()
   len(entries)<br />Traceback (most recent call last):
File "...\lib\site-packages\nltk\corpus\reader\cmudict.py", line 92, in read_cmudict_block
   entries.append( (pieces[0].lower(), pieces[2:]) )
IndexError: list index out of range
Even though it does not look like it, it is again the byte-to-string conversion. In cmudict.py, add after line #89
try:
    line = line.decode('utf-8','ignore')
except AttributeError:
    pass

The Comparative Wordlists section, while working, is not working as intended. The dictionary entries are in byte format, not string.
Here, some other measure must be taken. It might even fix some of the earlier encountered errors. If so, I will at some point in the future merge them.
In tokenize/simple.py, change the last function "line_tokenize" into this:
def line_tokenize(text, blanklines='discard'):
    try:
        text = text.decode('utf-8','ignore')
    except AttributeError:
        pass
    return LineTokenizer(blanklines).tokenize(text)

toolbox.entries('rotokas.dic')
Traceback (most recent call last):
 File "...\lib\site-packages\nltk\toolbox\toolbox.py", line 67, in raw_fields
   mobj = re.match(first_line_pat, line)
TypeError: can't use a string pattern on a bytes-like object
Well, it seems not to work under Python 2.7 either. No fixes for now (lines is easy to convert, but that raises future errors here).

wn.synsets('motorcar')
File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 915, in _load_lemma_pos_offset_map
   if line.startswith(' '):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
Again. I thought we were done with this one. Add in wordnet.py after line #914
try:
    line = line.decode('utf-8','ignore')
except:
    pass

File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1095, in _synset_from_pos_and_line
   columns_str, gloss = data_file_line.split('|')
TypeError: Type str doesn't support the buffer API
Regardless what the error message might imply, again a byte object has to be converted into a string. So, in the same file, add after line #1089, add
try:
    data_file_line = data_file_line.decode('utf-8','ignore')
except AttributeError:
    pass

>>> right.path_similarity(minke)
Traceback (most recent call last):
 File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 984, in get_version
   match = re.search(r'WordNet (\d+\.\d+) Copyright', line)
TypeError: can't use a string pattern on a bytes-like object
Again a boring error. In wordnet.py, after line #983, add
try:
    line = line.decode('utf-8','ignore')
except:
    pass
>>> right.path_similarity(minke)
Traceback (most recent call last):
 File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 551, in shortest_path_distance
   if path_distance < 0 or new_distance < path_distance:
TypeError: unorderable types: NoneType() < int()
Expected error, as the comparison operator was changed for non-meaningful orderable types, especially None. In the same file, simply change the offending line #551 to
if isinstance(path_distance,type(None)) or new_distance < path_distance:
Well, that's it for today.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.