Adventures of a Computer Scientist: Using nltk with Python 3 (5)

And on to some work with chapter five.

>>> nltk.pos_tag(text)
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\tag\__init__.py", line 64, in pos_tag
    tagger = nltk.data.load(_POS_TAGGER)
  File "...\lib\site-packages\nltk\data.py", line 594, in load
    resource_val = pickle.load(_open(resource_url))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)

Now we are finally getting to the root of the string/bytes problem. In data.py, change line #594 to:

try:
    resource_val = pickle.load(_open(resource_url))
except:
    resource_val = pickle.load(_open(resource_url),fix_imports=True, encoding='latin-1', errors="ignore")

>>>nltk.corpus.sinica_treebank.tagged_words()
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\corpus\reader\sinica_treebank.py", line 60, in _read_block
    sent = IDENTIFIER.sub('', sent)
TypeError: can't use a string pattern on a bytes-like object

In corpus/reader/sinica_treebank.py, add after line #59:

try:
    sent = sent.decode()
except:
    pass

>>> nltk.corpus.conll2002.tagged_words()
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python322\lib\site-packages\nltk\corpus\reader\util.py", line 577, in read_blankline_block
    line = stream.readline().decode('utf-8','ignore')
AttributeError: 'str' object has no attribute 'decode'

In corpus/reader/util.py, change line #577 into:

line = stream.readline()
try:
    line = line.decode('utf-8','ignore')
try:
    pass

>>> nltk.tag.brill.demo()
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\tag\brill.py", line 1308, in demo
    print_rules = file(rule_output, 'w')
NameError: global name 'file' is not defined

Well, file no longer exist. Change it to open in brill.py, line #1308:

print_rules = open(rule_output, 'w')

While we are at it, change line #1313 as well:

error_file = open(error_output, 'w')

That's it. That was chapter five. We are getting nearer to the core.

2 comments:

AnonymousMay 12, 2012 at 4:14 AM
change line #577,
try:
line = line.decode('utf-8','ignore')
except:
pass
UnknownNovember 12, 2014 at 9:25 PM
You have just saved me, thanks very much! However, in order to make pos_tag work I had only to do step number 1. Thanks nevertheless!

Note: Only a member of this blog may post a comment.

Saturday, December 17, 2011

Using nltk with Python 3 (5)

2 comments: