Adventures of a Computer Scientist: Using nltk with Python 3 (3)

Next is chapter three.
While at it, we will fix the included parsers, as I was playing with them and encountered some errors.

First, the parsers. While trying to run them, I run into a grave error, insisting that some parsers do not exist. That was due to a only partial complete wordnet_app.py. I am not sure why, it might be some git bug, it might be Windows specific. However, simply copying the missing lines from the 2.x nltk branch fixes it (only one print function has to be modified), the complete file will be in the download link at the end.

My Python environment here caused a new error to occur, relating to tkinter, Python's standard GUI package. Try entering

import tkinter
tkinter._test()

If this results in an error, add your tkinter path to your Python configuration. Simply run

import sys
sys.path.append('C:\\Program Files (x86)\\Python3\\Lib\tkinter')

Change it to your Python path, of course.

Now on to the parsers.
Chartparser nltk.app.chartparser() in app/chartparser_app.py.
Edit the font calls on lines #976 - #982

self._boldfont = tkinter.font.Font(family='helvetica', weight='bold',
        size=self._fontsize)
self._font = tkinter.font.Font(family='helvetica',
        size=self._fontsize)
# See: <http://www.astro.washington.edu/owen/ROTKFolklore.html>
self._sysfont = tkinter.font.Font(font=tkinter.Button()["font"])
root.option_add("*Font", self._sysfont)

The same for the lines #1696 - #1706

self._sysfont = tkinter.font.Font(font=tkinter.Button()["font"])
root.option_add("*Font", self._sysfont)

# TWhat's our font size (default=same as sysfont)
self._size = tkinter.IntVar(root)
self._size.set(self._sysfont.cget('size'))
self._boldfont = tkinter.font.Font(family='helvetica', weight='bold',
        size=self._size.get())
self._font = tkinter.font.Font(family='helvetica',
        size=self._size.get())

Next is the string joining, which should have been handled by 2to3.
Edit line #1115 to look like this:

rhs = ' '.join(rhselts)

Lines #1213:

rhs1 = ' '.join(rhs[:pos])
rhs2 = ' '.join(rhs[pos:])

Lines #2075:

sentence = ' '.join(self._tokens)

Last change here. Weird, line #939 is commented out. Simply activate it again:

self._init_fonts(root)

Chunkparser nltk.app.chunkparser() in app/chunkparser_app.py.
Same again, change tkfont and int conversion in lines #371 - #376

self._size = IntVar(top)
self._size.set(20)
self._font = tkinter.font.Font(family='helvetica',
        size=-self._size.get())
self._smallfont = tkinter.font.Font(family='helvetica',
        size=-int((self._size.get()*14/20)))

To fix a couple of integer errors, now edit chunk/regexp.py, line #132:

for i in range(int(1+len(brackets)/5000)):

Collocations in app/collocations_app.py.
The trend should be obvious now, simply change every occurance of

tkFont.Font

into

tkinter.font.Font

The next change is somewhat weird. Change line #195:

def next(self):

The next error took a while to figure out (Thanks, nltk error handling).
Edit corpus/reader/api.py, change line #309 into:

try:
    file_id, categories = line.decode().split(self._delimiter, 1)
except:
    file_id, categories = line.split(self._delimiter, 1)

Concordance in app/condordance_app.py.
Change every occurance of

tkFont.Font

into

tkinter.font.Font

Change line #262 to:

def next(self):

RDParser nltk.app.rdparser() in app/rdparser_app.py.
Add the following import after line #69:

import tkinter

Now the font calls for tkinter have to be changed. Edit the lines #140 - #147 to look like this:

self._boldfont = tkinter.font.Font(family='helvetica', weight='bold',
        size=self._size.get())
self._font = tkinter.font.Font(family='helvetica',
        size=self._size.get())
if self._size.get() < 0: big = self._size.get()-2
else: big = self._size.get()+2
self._bigfont = tkinter.font.Font(family='helvetica', weight='bold',
        size=big)

SRParser nltk.app.srparser() in app/srparser_app.py.
Change every occurance of

tkFont.Font

into

tkinter.font.Font

Now open up draw/util.py.

Change line #1794:

for x in range(left, right-w, int((right-left-w)/10)):

Change line #1798:

for y in range(top, bot-h, int((bot-top-h)/10)):

Change line #1799:

for x in range(left, right-w, int((right-left-w)/10)):

Wordnet.
Not sure yet, I had trouble getting the 2.x version to run as well, so we will come back here later.

Finally, on to chapter three.
Quite a lot works instantly now.

>>> sents = sent_tokenizer.tokenize(text)
File "...\lib\site-packages\nltk\tokenize\punkt.py", line 1150, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: can't use a string pattern on a bytes-like object

In punkt.py, line #1150, convert text to a string object. Add before the offending line:
try:

try:
    text = text.decode('utf-8','ignore')
except:
    pass

And that should be chapter three.

Adventures of a Computer Scientist

Saturday, December 17, 2011

Using nltk with Python 3 (3)

No comments:

Post a Comment