Friday, December 16, 2011

Using nltk with Python 3 (0)

Recently, I started doing some NLP using Python and the Nltk is one of those packages, that only run under Python 2. As Python 3 introduced some nice additional features, dropped some historically accumulated quirks and really improved Unicase support, I would much prefer using Python 3 while testing nltk and developing my own solutions. A short overview over the most important changes can be found under What's New In Python 3.0 and should prove quite useful in the conversion process.

Fortunately, a Python 3 branch has already been created,using mostly 2to3 and a couple of additional manual changes, at least according to the projects history. Under Windows, the first step was to create an installable package, using distribute, using the command
python setup-distutils.py bdist_wininst --target-version 3.2 --user-access-control force
or, alternatively, creating a msi installer using
python setup-distutils.py bdist_msi --target-version 3.2
Under *nix systems the installation should be even easier.

Nltk under Python 3 seems to run nicely at first, abut while working through the nltk book, quite a lot of the example code raises exceptions. Most of them are due to the change of the string, byte, and Unicode handling in Python. Some are due to the pickled data files and will be much harder to fix, until the nltk developers provide compatible versions.

While fixing some of the errors, I will document them here in future posts, mostly because they are quick fixes, not thoroughly changes in the underlying structures, as would be the proper way. I hope to become familiar enough with the source to implement them later on should I finally have enough time on my hands.

Note: Python 3.2 has some trouble importing pickled data files. The bug was fixed in Python 3.2.2. Update if needed, as this fixes at least some errors while using the corpora.

Update: When all is done, there will be a more systematical overview.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.