Friday, December 16, 2011

Using nltk with Python 3 (1)

Lets start by working through chapter 1 of the official nltk book.


The first error occurs in the following line:
from nltk.book import *

File "...\lib\site-packages\nltk\misc\wordfinder.py", line 12, in <module>
    from string import strip
ImportError: cannot import name strip
That is one of the basic errors we will run across multiple times. "strip" is no longer in the string module, instead it is now a string classmethod (it is available as that in Python 2.7 as well, at least). Now, be simply remove the offending line #12 and change line #83 in wordfinder.py from
word = strip(word).upper()   # normalize
to
word = word.upper().strip() # normalize

That corrected, the import statement now raises the next error while trying to import Sense and Sensibility by Jane Austen 1811.

File "...\lib\site-packages\nltk\data.py", line 1091, in _read
    bytes = self.bytebuffer + new_bytes
TypeError: Can't convert 'bytes' object to str implicitly
That is the second and probably most often occuring error. As unicode was turned into string and string turned into bytes, some conversions do not work automatically any longer. The stream reader returns sometimes bytes, sometimes strings (this is one of the things that should be normalized later on, but let's just fix the method for now). Now, we might be lured into decoding new_bytes into a string everytime it is a bytes object, but that just raises a new error when nltk actually decodes it. So let us make sure that the bytebuffer is always in a byte format. We just insert before line #1091 in data.py
try:
    self.bytebuffer = self.bytebuffer.encode()
except:
    pass

And on to the next error.

File "...\lib\site-packages\nltk\corpus\reader\plaintext.py", line 138, in _read_word_block
    words.extend(self._word_tokenizer.tokenize(stream.readline().decode()))
AttributeError: 'str' object has no attribute 'decode'
No worries here, it simply someties gets a string object, and sometimes a bytes object. Let's just make sure both cases are handled, by removing line #138 in plaintext.py and inserting in its place the following:
readlines = stream.readline()
try:
    readlines.decode()
except:
    pass
words.extend(self._word_tokenizer.tokenize(readlines))

That is, admittedly, even uglier than the previous fix. Just bear with it for now, I fully intend to come back later. At least the except clause should not catch every possible error. And on to the next one. This error message should now look somewhat familiar.

File "...\lib\site-packages\nltk\tokenize\regexp.py", line 102, in tokenize
    return self._regexp.findall(text)
TypeError: can't use a string pattern on a bytes-like object
Yes, bytes vs. string again. Just insert before the mentioned line #102 in regexp.py the following:
try:
    text = text.decode()
except AttributeError:
    pass

At least now only the right exception will be caught (and not handled, of course). But next we get an error I did not expect.

File "...\lib\site-packages\nltk\data.py", line 1002, in _char_seek_forward
    newbytes = self.stream.read(est_bytes-len(bytes))
TypeError: 'float' object cannot be interpreted as an integer
I thought this would be caught by 2to3. The reason for this error is the "new" division handling in Python 3: Dividing an int by an int will now return a float if appropriate instead of rounding to the next int. Simply change line #1002 to
newbytes = self.stream.read(int(est_bytes-len(bytes)))

Onto the next one.

File "...\lib\site-packages\nltk\data.py", line 1003, in _char_seek_forward
    bytes += newbytes
TypeError: Can't convert 'bytes' object to str implicitly
Again, this one we know already. Just convert explicitly. Change line #1003 in data.py into the following:
try:
    bytes = bytes.encode()
except AttributeError:
    pass
bytes += newbytes

File "...\lib\site-packages\nltk\tokenize\regexp.py", line 103, in tokenize
    text = text.decode()
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa1 in position 309: invalid start byte
Now its starting to get interesting. Here we need to tell the decoder not be so strict while trying to decode. Change the line in regexp.py into:
text = text.decode('utf-8','ignore')
Could as well have used 'latin-1' instead of 'utf-8', but we will come to that in a later post.

File "...\lib\site-packages\nltk\corpus\reader\xmldocs.py", line 167, in _detect_encoding
    m = re.match(r'\s*<?xml\b.*\bencoding="([^"]+)"', s)
File "...\lib\re.py", line 153, in match
    return _compile(pattern, flags).match(string)
TypeError: can't use a string pattern on a bytes-like object
I am not sure I found the best solution here, but insert before line #167 in xmldocs.py:
try:
    s = s.decode()
except AttributeError:
    pass

File "...\lib\site-packages\nltk\corpus\reader\util.py", line 609, in read_regexp_block
    if re.match(start_re, line): break
  File "...\lib\re.py", line 153, in match
    return _compile(pattern, flags).match(string)
TypeError: can't use a string pattern on a bytes-like object
We are quite close to the end of, at least, this function. Insert before line #609 in util.py
try:
    line = line.decode()
except AttributeError:
    pass
The same error will occur a couple of line below, as line is read in again. Insert after line #619 (line = stream.readline()) the following
try:
    line = line.decode()
except AttributeError:
    pass

Now the books should all load correctly. But we are not done yet. Following the book furher leads us to the following line.

text1.concordance("monstrous")

File "...\lib\site-packages\nltk\text.py", line 193, in print_concordance
    left = (' ' * half_width +
TypeError: can't multiply sequence by non-int of type 'float'
Well, this error occured already. Division results a non-integer sometimes, so just change line #183 and line #184 in text.py from
half_width = (width - len(word) - 2) / 2
context = width/4 # approx number of words of context
to
half_width = int( (width - len(word) - 2) / 2)
context = int(width/4) # approx number of words of context

File "...\lib\site-packages\nltk\text.py", line 52, in <listcomp>
    tokens = [t for t in tokens if list(filter(t))]
Another one that should have been automatically fixed. Ah, well. Just change line #52 in text.py from
tokens = [t for t in tokens if list(filter(t))]
to
tokens = [t for t in tokens if filter(t)]
The next steps should all work smoothly, until we start the machine translation part.

babelize_shell()

File "...\lib\site-packages\nltk\misc\babelfish.py", line 166, in babelize_shell
    command = eval(input('Babel> '))
  File "<string>", line 1
    how long before the next flight to Alice Springs?
           ^
SyntaxError: invalid syntax 
I don't know how that got into the code, but simply change line #166 in babelfish.py to
command = input('Babel> ')
Now, while we are here, make the following changes in the same file. Before line #105 response = urllib.request.urlopen(... insert
try:
    params = params.encode()
except AttributeError:
    pass
and a couple of line later, after
html = response.read()
insert
try:
    html = html.decode()
except AttributeError:
    pass


That should take care of chapter one. Note: I did not fix the chatbots, as they are rather tangential here. The errors there should be about the same as the one in babelfish, but we will come to back later (much later).
Here is a zip with the changed files.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.