Adventures of a Computer Scientist: December 2011

Sunday, December 18, 2011

Porting nltk to Python 3 - A systematic approach

Update: Look here for Python 3 development. You might also want to visit the user group or the development group.

Being done working through the nltk book, I finally started working somewhat more systematically. I am currently setting up a number of automatic tests (the official ones seem to be quite outdated, at least according to the source/issue tracker) and working on the identified classes of issues.
I fixed some weird of the weirs fixes I made last time, as well (general except clause, loop inside a try, ...). The next step was looking for all occurrences of file readers using the bytes method and changing them into string (Unicode) readers.

Everything in chapter two should now work, including the stop words examples (fixed now) and the toolbox (fixed earlier in another chapter).

So far, errors are mainly caused by:

String/bytes/Unicode
Division returning a float instead of rounding
Differences in iterable objects/lists
Comparison no longer works with non-comparable object (especially while sorting lists)
tkinter name changes

That's the gist for now, but I am sure more will come to light while working on it.

Here the changed files: nltk_rev_1_changes.zip
Here the complete source: nltk_rev_1_complete.zip
Here a Windows installer (32-Bit): nltk_rev_1_win32_installer.msi

Using nltk with Python 3 (overview)

I am finally done getting at least the code from the official nltk book to work under Python 3. Aside from two things that do not work yet (will be covered later; might be due to changes in the nltk code base), it runs flawlessly

I learned quite a lot about the source code of nltk, so now there will be a more systematic approach. Though I probably still won't change the stream readers for now. As I found the official book somewhat lacking -- I did not want a general Python tutorial, and I prefer a somewhat more consistent introductial approach to language processing than a book that seems to be aimed at students offers -- I finally visited the nearest library. Now I will test my changes to the source on the examples and exercises found in (no particular order)

McNeil, J. (2010): Python 2.6 Text Processing. The easiest way to learn how to manipulate text with Python.
Perkins, J. (2010): Python Text Processing with NLTK 2.0 Cookbook.

Should I find the time, I will accompany the simple code changes with some snippets I am working on. I am thinking mainly about implementing parallel NLP tasks to finally apply my basic knowledge of Python multiprocessing and/or MapReduce/Hadoop.

Here are the changed files: (see here)
Here the complete nltk source with all changes: (see here)
Here is a Windows installer for nltk under Python 3 (x86): (see here)
Here is a complete and short list, consisting of all changes made to the nltk source: (TBD)

All parts from this post series:

Using nltk with Python 3 (11)

And on to chapter 11: Managing Linguistic Data
It would be nice if the book could be updated some time in the future, as some namespaces are no longer correct. It works, though, sometimes it is just a matter of finding the correct name.

Using nltk with Python 3 (7)

Chapter seven: Text extraction.

Using nltk with Python 3 (6)

And chapter six it is.

The Bayes Classifier raises an error, but it does so as well in Python 2.7. Maybe the book is wrong here or uses a dated version? I will step over it for now, as the classifiers will be looked upon more closely later on, when the official nltk book is done. There are far more interestings bookts concerning nltk out there.

Using nltk with Python 3 (5)

And on to some work with chapter five.

Using nltk with Python 3 (4)

Chapter four should already work.

Using nltk with Python 3 (3)

Next is chapter three.
While at it, we will fix the included parsers, as I was playing with them and encountered some errors.

Using nltk with Python 3 (2)

This time we will update nltk to fully work the code samples from chapter two.

Using nltk with Python 3 (1)

Lets start by working through chapter 1 of the official nltk book.

Using nltk with Python 3 (0)

Recently, I started doing some NLP using Python and the Nltk is one of those packages, that only run under Python 2. As Python 3 introduced some nice additional features, dropped some historically accumulated quirks and really improved Unicase support, I would much prefer using Python 3 while testing nltk and developing my own solutions. A short overview over the most important changes can be found under What's New In Python 3.0 and should prove quite useful in the conversion process.

Fortunately, a Python 3 branch has already been created,using mostly 2to3 and a couple of additional manual changes, at least according to the projects history. Under Windows, the first step was to create an installable package, using distribute, using the command

python setup-distutils.py bdist_wininst --target-version 3.2 --user-access-control force 

or, alternatively, creating a msi installer using
python setup-distutils.py bdist_msi --target-version 3.2
Under *nix systems the installation should be even easier.

Nltk under Python 3 seems to run nicely at first, abut while working through the nltk book, quite a lot of the example code raises exceptions. Most of them are due to the change of the string, byte, and Unicode handling in Python. Some are due to the pickled data files and will be much harder to fix, until the nltk developers provide compatible versions.

While fixing some of the errors, I will document them here in future posts, mostly because they are quick fixes, not thoroughly changes in the underlying structures, as would be the proper way. I hope to become familiar enough with the source to implement them later on should I finally have enough time on my hands.

Note: Python 3.2 has some trouble importing pickled data files. The bug was fixed in Python 3.2.2. Update if needed, as this fixes at least some errors while using the corpora.

Update: When all is done, there will be a more systematical overview.

Adventures of a Computer Scientist