Saturday, December 17, 2011

Using nltk with Python 3 (5)

And on to some work with chapter five.



>>> nltk.pos_tag(text)
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\tag\__init__.py", line 64, in pos_tag
    tagger = nltk.data.load(_POS_TAGGER)
  File "...\lib\site-packages\nltk\data.py", line 594, in load
    resource_val = pickle.load(_open(resource_url))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcb in position 0: ordinal not in range(128)
Now we are finally getting to the root of the string/bytes problem. In data.py, change line #594 to:
try:
    resource_val = pickle.load(_open(resource_url))
except:
    resource_val = pickle.load(_open(resource_url),fix_imports=True, encoding='latin-1', errors="ignore")
>>>nltk.corpus.sinica_treebank.tagged_words()
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\corpus\reader\sinica_treebank.py", line 60, in _read_block
    sent = IDENTIFIER.sub('', sent)
TypeError: can't use a string pattern on a bytes-like object
In corpus/reader/sinica_treebank.py, add after line #59:
try:
    sent = sent.decode()
except:
    pass
>>> nltk.corpus.conll2002.tagged_words()
Traceback (most recent call last):
  File "C:\Program Files (x86)\Python322\lib\site-packages\nltk\corpus\reader\util.py", line 577, in read_blankline_block
    line = stream.readline().decode('utf-8','ignore')
AttributeError: 'str' object has no attribute 'decode'
In corpus/reader/util.py, change line #577 into:
line = stream.readline()
try:
    line = line.decode('utf-8','ignore')
try:
    pass
>>> nltk.tag.brill.demo()
Traceback (most recent call last):
  File "...\lib\site-packages\nltk\tag\brill.py", line 1308, in demo
    print_rules = file(rule_output, 'w')
NameError: global name 'file' is not defined
Well, file no longer exist. Change it to open in brill.py, line #1308:
print_rules = open(rule_output, 'w')
While we are at it, change line #1313 as well:
error_file = open(error_output, 'w')
That's it. That was chapter five. We are getting nearer to the core.

2 comments:

  1. change line #577,
    try:
    line = line.decode('utf-8','ignore')
    except:
    pass

    ReplyDelete
  2. You have just saved me, thanks very much! However, in order to make pos_tag work I had only to do step number 1. Thanks nevertheless!

    ReplyDelete

Note: Only a member of this blog may post a comment.