The first error should be raised around the time the for loop is run:
for fileid in gutenberg.fileids(): ...The error is caused by corpus/reader/util.py. Change line #570 into
line = stream.readline().decode('utf-8','ignore')
The next error occurs while opening the other language corpora.
nltk.corpus.cess_esp.words() File "...\lib\site-packages\nltk\corpus\reader\util.py", line 621, in read_regexp_block line = line.decode() UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 31: invalid continuation byteNot really surprising, same file, same error. Change line #621 into
line = line.decode('utf-8','ignore')While you are at it, change line #610
line = line.decode('utf-8','ignore')
Next error, next line:
nltk.corpus.indian.words('hindi.pos') File "...\lib\site-packages\nltk\corpus\reader\indian.py", line 76, in read_block if line.startswith('<'): TypeError: startswith first arg must be bytes or a tuple of bytes, not strThe error message seems to be a bug in Python, as it is exactly the other way around. But still, no problems here, add after line #75 in indian.py
try: line = line.decode() except AttributeError: pass
Now we can go on until the Wordlist Corpora section.
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt')) File "...\lib\site-packages\nltk\corpus\reader\util.py", line 461, in concat raise ValueError("Don't know how to concatenate types: %r" % types) ValueError: Don't know how to concatenate types: {<class 'bytes'>}This one is a bit tricky. Still, no worries. Add after line #423 in util.py, so it looks like this:
newdocs = [] try: for item in docs: newdocs.append(item.decode('utf-8','ignore')) docs = newdocs except AttributeError: pass types = set([d.__class__ for d in docs])
The stopwords example seems not to work correctly, but as it is the same as in Python 2, I will not touch it for now.
A Pronouncing Dictionary raises the next error:
entries = nltk.corpus.cmudict.entries() len(entries)<br />Traceback (most recent call last): File "...\lib\site-packages\nltk\corpus\reader\cmudict.py", line 92, in read_cmudict_block entries.append( (pieces[0].lower(), pieces[2:]) ) IndexError: list index out of rangeEven though it does not look like it, it is again the byte-to-string conversion. In cmudict.py, add after line #89
try: line = line.decode('utf-8','ignore') except AttributeError: pass
The Comparative Wordlists section, while working, is not working as intended. The dictionary entries are in byte format, not string.
Here, some other measure must be taken. It might even fix some of the earlier encountered errors. If so, I will at some point in the future merge them.
In tokenize/simple.py, change the last function "line_tokenize" into this:
def line_tokenize(text, blanklines='discard'): try: text = text.decode('utf-8','ignore') except AttributeError: pass return LineTokenizer(blanklines).tokenize(text)
toolbox.entries('rotokas.dic') Traceback (most recent call last): File "...\lib\site-packages\nltk\toolbox\toolbox.py", line 67, in raw_fields mobj = re.match(first_line_pat, line) TypeError: can't use a string pattern on a bytes-like objectWell, it seems not to work under Python 2.7 either. No fixes for now (lines is easy to convert, but that raises future errors here).
wn.synsets('motorcar') File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 915, in _load_lemma_pos_offset_map if line.startswith(' '): TypeError: startswith first arg must be bytes or a tuple of bytes, not strAgain. I thought we were done with this one. Add in wordnet.py after line #914
try: line = line.decode('utf-8','ignore') except: pass
File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 1095, in _synset_from_pos_and_line columns_str, gloss = data_file_line.split('|') TypeError: Type str doesn't support the buffer APIRegardless what the error message might imply, again a byte object has to be converted into a string. So, in the same file, add after line #1089, add
try: data_file_line = data_file_line.decode('utf-8','ignore') except AttributeError: pass
>>> right.path_similarity(minke) Traceback (most recent call last): File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 984, in get_version match = re.search(r'WordNet (\d+\.\d+) Copyright', line) TypeError: can't use a string pattern on a bytes-like objectAgain a boring error. In wordnet.py, after line #983, add
try: line = line.decode('utf-8','ignore') except: pass
>>> right.path_similarity(minke) Traceback (most recent call last): File "...\lib\site-packages\nltk\corpus\reader\wordnet.py", line 551, in shortest_path_distance if path_distance < 0 or new_distance < path_distance: TypeError: unorderable types: NoneType() < int()Expected error, as the comparison operator was changed for non-meaningful orderable types, especially None. In the same file, simply change the offending line #551 to
if isinstance(path_distance,type(None)) or new_distance < path_distance:Well, that's it for today.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.