The first error occurs in the following line:
from nltk.book import * File "...\lib\site-packages\nltk\misc\wordfinder.py", line 12, in <module> from string import strip ImportError: cannot import name stripThat is one of the basic errors we will run across multiple times. "strip" is no longer in the string module, instead it is now a string classmethod (it is available as that in Python 2.7 as well, at least). Now, be simply remove the offending line #12 and change line #83 in wordfinder.py from
word = strip(word).upper() # normalizeto
word = word.upper().strip() # normalize
That corrected, the import statement now raises the next error while trying to import Sense and Sensibility by Jane Austen 1811.
File "...\lib\site-packages\nltk\data.py", line 1091, in _read bytes = self.bytebuffer + new_bytes TypeError: Can't convert 'bytes' object to str implicitlyThat is the second and probably most often occuring error. As unicode was turned into string and string turned into bytes, some conversions do not work automatically any longer. The stream reader returns sometimes bytes, sometimes strings (this is one of the things that should be normalized later on, but let's just fix the method for now). Now, we might be lured into decoding new_bytes into a string everytime it is a bytes object, but that just raises a new error when nltk actually decodes it. So let us make sure that the bytebuffer is always in a byte format. We just insert before line #1091 in data.py
try: self.bytebuffer = self.bytebuffer.encode() except: pass
And on to the next error.
File "...\lib\site-packages\nltk\corpus\reader\plaintext.py", line 138, in _read_word_block words.extend(self._word_tokenizer.tokenize(stream.readline().decode())) AttributeError: 'str' object has no attribute 'decode'No worries here, it simply someties gets a string object, and sometimes a bytes object. Let's just make sure both cases are handled, by removing line #138 in plaintext.py and inserting in its place the following:
readlines = stream.readline() try: readlines.decode() except: pass words.extend(self._word_tokenizer.tokenize(readlines))
That is, admittedly, even uglier than the previous fix. Just bear with it for now, I fully intend to come back later. At least the except clause should not catch every possible error. And on to the next one. This error message should now look somewhat familiar.
File "...\lib\site-packages\nltk\tokenize\regexp.py", line 102, in tokenize return self._regexp.findall(text) TypeError: can't use a string pattern on a bytes-like objectYes, bytes vs. string again. Just insert before the mentioned line #102 in regexp.py the following:
try: text = text.decode() except AttributeError: pass
At least now only the right exception will be caught (and not handled, of course). But next we get an error I did not expect.
File "...\lib\site-packages\nltk\data.py", line 1002, in _char_seek_forward newbytes = self.stream.read(est_bytes-len(bytes)) TypeError: 'float' object cannot be interpreted as an integerI thought this would be caught by 2to3. The reason for this error is the "new" division handling in Python 3: Dividing an int by an int will now return a float if appropriate instead of rounding to the next int. Simply change line #1002 to
newbytes = self.stream.read(int(est_bytes-len(bytes)))
Onto the next one.
File "...\lib\site-packages\nltk\data.py", line 1003, in _char_seek_forward bytes += newbytes TypeError: Can't convert 'bytes' object to str implicitlyAgain, this one we know already. Just convert explicitly. Change line #1003 in data.py into the following:
try: bytes = bytes.encode() except AttributeError: pass bytes += newbytes
File "...\lib\site-packages\nltk\tokenize\regexp.py", line 103, in tokenize text = text.decode() UnicodeDecodeError: 'utf8' codec can't decode byte 0xa1 in position 309: invalid start byteNow its starting to get interesting. Here we need to tell the decoder not be so strict while trying to decode. Change the line in regexp.py into:
text = text.decode('utf-8','ignore')Could as well have used 'latin-1' instead of 'utf-8', but we will come to that in a later post.
File "...\lib\site-packages\nltk\corpus\reader\xmldocs.py", line 167, in _detect_encoding m = re.match(r'\s*<?xml\b.*\bencoding="([^"]+)"', s) File "...\lib\re.py", line 153, in match return _compile(pattern, flags).match(string) TypeError: can't use a string pattern on a bytes-like objectI am not sure I found the best solution here, but insert before line #167 in xmldocs.py:
try: s = s.decode() except AttributeError: pass
File "...\lib\site-packages\nltk\corpus\reader\util.py", line 609, in read_regexp_block if re.match(start_re, line): break File "...\lib\re.py", line 153, in match return _compile(pattern, flags).match(string) TypeError: can't use a string pattern on a bytes-like objectWe are quite close to the end of, at least, this function. Insert before line #609 in util.py
try: line = line.decode() except AttributeError: passThe same error will occur a couple of line below, as line is read in again. Insert after line #619 (line = stream.readline()) the following
try: line = line.decode() except AttributeError: pass
Now the books should all load correctly. But we are not done yet. Following the book furher leads us to the following line.
text1.concordance("monstrous") File "...\lib\site-packages\nltk\text.py", line 193, in print_concordance left = (' ' * half_width + TypeError: can't multiply sequence by non-int of type 'float'Well, this error occured already. Division results a non-integer sometimes, so just change line #183 and line #184 in text.py from
half_width = (width - len(word) - 2) / 2 context = width/4 # approx number of words of contextto
half_width = int( (width - len(word) - 2) / 2) context = int(width/4) # approx number of words of context
File "...\lib\site-packages\nltk\text.py", line 52, in <listcomp> tokens = [t for t in tokens if list(filter(t))]Another one that should have been automatically fixed. Ah, well. Just change line #52 in text.py from
tokens = [t for t in tokens if list(filter(t))]to
tokens = [t for t in tokens if filter(t)]The next steps should all work smoothly, until we start the machine translation part.
babelize_shell() File "...\lib\site-packages\nltk\misc\babelfish.py", line 166, in babelize_shell command = eval(input('Babel> ')) File "<string>", line 1 how long before the next flight to Alice Springs? ^ SyntaxError: invalid syntaxI don't know how that got into the code, but simply change line #166 in babelfish.py to
command = input('Babel> ')Now, while we are here, make the following changes in the same file. Before line #105 response = urllib.request.urlopen(... insert
try: params = params.encode() except AttributeError: passand a couple of line later, after
html = response.read()insert
try: html = html.decode() except AttributeError: pass
That should take care of chapter one. Note: I did not fix the chatbots, as they are rather tangential here. The errors there should be about the same as the one in babelfish, but we will come to back later (much later).
Here is a zip with the changed files.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.