Paraglare is Doubleplusgood

Every now and again I look for a Python parser framework. I tend to want something highly declarative but also as easy to use and debug as PetitParser (which inclines me toward PEG grammar systems). I’d really like multi language targeting so people can reuse my grammars. (But then I want the actions to be specifiable as well.)

Usually I’m disappointed. I read about a bazillion systems and nothing happens.

I did this again the other day because I wanted to write my reasoner tests in Jewel and 1) I didn’t want to find, much less get working, my Smalltalk Jewel parser and 2) hooking that up to Python tests seemed even more awful.

So, I did my usual cursory look and stumbled on Paraglare. Feature list looked promising…esp. the idea of good error recovery so I installed it…

…and it just worked! What a treat! Here’s my grammar:

ONT: AX*;

CE: CE '&' CE  {left, 1}
| '' CE  {left, 1}
| name;

AX: CAX | RAX;
CAX: CE '=>' CE end;
RAX: name '==>' name end;

terminals
name: /[A-Za-z][A-Za-z0-9]*/;
end: /\./;

And here are my actions:

actions = {
"CAX": [lambda _, nodes: SubClassOf(nodes[0], nodes[2])],
"RAX": [lambda _, nodes: SubPropertyOf(Property(nodes[0]), Property(nodes[2]))],
"CE": [lambda _, nodes: And(nodes[0], nodes[2]),
lambda _, nodes: Some(Property(nodes[1]), nodes[3]),
lambda _, nodes: Class(nodes[0])]}

Daaaaamn that was easy! It’s not a complex grammar but still. It’s parsing my tests no problem!

Good stuff.

Nuitka Markdown Performance

Mistune has a…not terrific…benchmark script that tests a bunch of different parsers. There’s a bit of discussion in Hsiaoming Yang’s (the author of Mistune) blog post, though that’s more a reflective review of Python accessible Markdown parsers. It keeps everything in one process and just parses the same document over and over. But hey! It’s there and why not use it to try for a broader view of how well Nuitka does on optimising string heavy applications.

Some of the parsers didn’t work after pip installing so I dropped them. First, let’s look at the results from the blog post:

Parsing the Markdown Syntax document 1000 times...
Mistune: 12.7255s
Mistune (with Cython): 9.74075s
Misaka: 0.550502s
Markdown: 46.4342s
Markdown2: 78.2267s

Ok, I won’t have the Cython version but fine! First we have regular Python:

$python bench.py Parsing the Markdown Syntax document 1000 times... mistune: 14.963354 misaka: 0.43727499999999964 markdown: 48.857309 markdown2: 311.745707 hoep: 0.5286040000000298$ ./bench.bin
Parsing the Markdown Syntax document 1000 times...
mistune: 12.439452999999999
misaka: 0.39941600000000044
markdown: 37.808417999999996
markdown2: 631.708383
hoep: 0.4311249999999518

Ok wackiness. First, I’d guess my machine is roughly similar to HY’s, just looking at the times and his reporting that he ran on a MacBook Air (mine’s pretty old).

However, markdown2 is way out of whack, being much much slower on my machine. I’d guess it’s a different version?

The second wackiness is that compiling with Nuitka doubles the time to complete the benchmark. I did read that this can happen (in a bug report thread). It seems to be an issue with dynamic libraries.

So Mistune with Cython takes 76% of the time as without, while Nuitka compiled Mistune takes 83% of the time. That’s pretty good esp. as I didn’t have to touch a line of code with Nuitka. While Cython claims to compile normal Python code (getting better speedups as you add typing info), just looking at the tutorial suggests that even given that, Cython wants you to do a lot of stuff to get it to work, e.g., .pyx files and setup.py shenanigans. Maybe for another day, esp. as Mistune doesn’t seem to have Cython support anymore.

One clear message is always test and compare. The markdown2 slowdown is a bit of surprise. I didn’t do a PyInstaller version of the bench.py script to see how sizes compare (bench.bin is about 3.6 megs). This seems like a great test bed for a good N-Version testing framework.

I have to wonder how hard it would be to have Nuitka either compile Cython or compile to Cython, esp as it gets its type inference going.

First Nuitka Experiment

I generally use PyInstaller to package up my Python command line apps. Its a bit finicky but gives me a single file executable in the end. Most of my small pool of users aren’t going to want to install and run a (complex with lots of dependencies) Python script and often don’t have the 3.x version of Python I’m using.

PyInstaller basically makes a self contained virtual environment so includes a copy of the interpreter and installed site packages (as well, I guess, as the appropriate standard library). For my exam setting tool, this results in a self contained file of around 8.9 megs. Not huge at all, but not trivial for e.g., email. It also doesn’t compress well, only going from 8.9 megs to 8.8.

It also doesn’t cross compile. I would think that a fat “binary” wouldn’t add a lot of weight. Anyway, PyInstaller doesn’t do it.

Today I decided to see if I could use Nuitka as a PyInstaller replacement. Incidentally, I could try some performance testing, which has not been available on the Nuitka website for quite a while.

PyInstaller and Nuitka are very different beasts. PyInstaller gathers up the existing scripts and interpeter and hides them in a zipped directory. When you run it, it “unfolds” all that into a temporary site and then runs things ‘normally’.

Nuitka is a Python to C compiler. It transforms your Python code into C code the compiles it into a binary. Right now, Nuitka links to libpython which is a good chunk of the interpreter. But still!

First, Nuitka worked great out of the box with this command:

python -m nuitka --follow-imports setexams.py

It even grabbed my LaTeX template and shoved it in. Maybe it’s exploited the work I did to force PyInstaller to do that…I don’t know. It seemed magically.

Here’s the sizes:

-rwxr-xr-x@ 1 bparsia  staff  8895688  2 Nov 14:37 setexams
-rwxr-xr-x  1 bparsia  staff  2058336 26 Jan 08:33 setexams.bin
-rw-r--r--  1 bparsia  staff   684759 26 Jan 09:23 setexams.bin.zip

From almost 9 megs to just over 2. Nice! And what’s better, is that it compresses really well, down to 684k.

That’s a considerable win for me.

Now Nuitka is primarily designed as an optimising compiler, so maybe we’ll see a speed up? I have a Markdown processor, a lot of string hacking for the parser, and interpolation for the generators. I took a simple example and repeated questions until I had an input of 4.5 megs.

$time ./setexams -g --pdf=None performance_test.exam real 0m10.570s user 0m9.613s sys 0m0.253s$ time ./setexams.bin -g --pdf=None performance_test.exam
real    0m8.629s
user    0m7.918s
sys     0m0.218s

There’s a small gain which I wouldn’t be surprised is start up time. Oh well, it’d have been cool if Nuitka produced a big speed gain right off the bat, but I’m not terribly bothered. I expect string handling hasn’t been seriously optimised.

However, this does suggest that I should put together some tests cases for Mistune and the like. Speeding up Markdown processing is likely an attention getting win for Nuitka.

The size benefits are a sufficient win for me.

Ideas Not Working Out (or WILL THEY?!?!?)

Well this is a bit discouraging. I was experimenting with implementing an ELK style EL classification oriented reasoner primarily using SQLite. I love SQLite and it is super cool. I want to do more with it. If you look at the algorithms in the Incredible ELK paper you see that they involve a fair number of joins and indexes, etc. They even discuss join order and reference some attempts to implement on a database. I figured using SQLite in memory databases could be a nice win.

Nope nope nope nope. Well, not yet. The problem is that it’s way to slow to interact with the database. Consider:

43393 loops done in 1.2992407480875652 minutes
3224564 function calls in 77.975 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
183895 70.709 0.000 75.125 0.000 {method 'execute' of 'apsw.Cursor' objects}
...
28951 0.079 0.000 34.495 0.001 calf.py:443(in_subs)

I’m testing for whether a subsumption in my todo list is in the closure. I’m using a query:

SELECT 1 FROM subs
WHERE subclass = ? and superclass = ?

Ok that should probably be an EXISTS, but still. It’s called a fair bit (28,951 times!).

My subsumptions here are just binary tuples of integers, so I can use a Python set to track ’em easily enough. And here’s the results:

43393 loops done in 0.6202206969261169 minutes
2819244 function calls in 37.231 seconds

Ordered by: internal time

ncalls tottime percall cumtime percall filename:lineno(function)
154944 32.796 0.000 35.274 0.000 {method 'execute' of 'apsw.Cursor' objects}
...
28951 0.025 0.000 0.053 0.000 calf.py:443(in_subs)

Well that leaves a mark.

It’s an in memory database and I’ve dorked with the transactions and no joy.

Now it’s not hopeless maybe. Consider these:

22592   0.320 SELECT ?, superclass FROM concIncs  WHERE concIncs.subclass = ?
22591   0.302 SELECT type, arg1, arg2 FROM IdxConcept WHERE id=?
22592   0.180 INSERT INTO subs VALUES (?, ?)
14442   0.141 SELECT 1 FROM inits WHERE id = ?

Same order of magnitude of queries. Sometimes returning more data. The INSERTs into subs are pretty fast. Is it just the conjunction?

Oh crap. I didn’t have a primary key on subs. It’s compound and I forgot…D’OH! Let’s fix that and…

43393 loops done in 0.14656875133514405 minutes
3224564 function calls in 8.811 seconds

Ordered by: internal time

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
183895    6.106    0.000    7.675    0.000 {method 'execute' of 'apsw.Cursor' objects}
...
28951     0.033    0.000    0.462    0.000 calf.py:443(in_subs)

Well, that’s much better! Game on! The links query is still a bit slow:

22592  4.977 SELECT links.subclass as E, negExists.existential as F
FROM  (links JOIN hier as h1 ON links.role = h1.subrole),
(hier as h2 JOIN negExists ON h2.superrole = negExists.role)
WHERE links.filler = ? AND negExists.filler = ?

No obvious issues. I’m unclear whether an index will help…but stilL! Back in the game!

Two Months of Nearly Daily Blogging to Go

I just might make it.

It’s been touch and go with classes and illness and feeling overwhelmed. I’ve not had the energy to write enough analytical stuff. My backlog there is growing.

I did fix a superfun bug in my exam authoring tool. It is a classic “dynamic typing plus truthiness” bug that gets static typing folks very excited.

I have a Python object that represents true-false questions. It inherits from an abstract question object. So far, fine. It is parsable from and printable to a simple format eg:

Q: This sentence is true! + F Marks: 1 By: BJP

(Yes the sentence is true if you assume it’s true and false if you assume it’s false.)

This is fine until I used the “shuffle” option which reorganises the exam and question options. What I was getting out was:

Q: This sentence is true! + T Marks: 1 By: BJP

This is bad! And has nothing to do with shuffling. The culprit looked like this:

options = '+ T' if self.key else '+ F'

This would have worked if the key was a Boolean. But it was a string: “True” or “False”. Any non empty string is truthy so the else never gets taken.

D’oh!

Either types on variables or no truthiness would have flushed this out. It was damned hard to spot esp when I wrote tests that constructed the object directly and set the key to a Boolean. (Unit testing didn’t work! It needed an integration test at least.)

Oy!

Two months to go…

Amit Patel’s multi part discussion of A* (esp in games) is a very nice read. Even if you just read the first page, you’ll get a clear picture of various pathfinding algorithms and their trade offs. So later bits aren’t quite as nice to follow (e.g., the trade offs between terms in the heuristic could be fruitfully visually indicated), but overall, it’s great.

It’s part of a series teaching math and computer science via games. Not necessarily wildly original in concept (cf AI: A Modern Approach), but here execution is everything. Check out the discussion of pathfinding for a tower defence game.

PythonTeX

I went down the rabbit hole trying to code up some simple class data analysis tools. I have a custom script for exam results which uses Jinja2 as a template language and generates a Markdown doc. It’s alright, but I figured there must be something ready made that wasn’t a “note book” and could generate documents and reveal.js slides.

Nope. Not so I could use.

I had knitpy in a tab for forever so I though I’d give a spin. It seems both moribund and broken for all my struggling could demonstrate.

I poked at Pweave. I don’t know if I just lost the will to live or I really couldn’t get it working. Either way I was spending more time looking at the tool than pulling in and massaging the data, so I gave up.

Along the way I came across a paper on PythonTeX which has some interesting arguments:

PythonTeX makes possible reproducible LaTeX documents in which Python, Ruby, and Julia code is embedded. This can mitigate the potential for copy-and-paste errors, simplify the creation of figures, and allow significant portions of documents to be generated automatically. Built-in utilities for dependency tracking reduce the need for makefiles or similar systems. Tight LaTeX integration allows code to adapt to its context in a document. User-defined sessions that run in parallel provide high performance. Synchronization of errors and warnings with document line numbers mean that writing and coding remain efficient. Finally, the depythontex utility ensures that PythonTeX may be used even when plain LaTeX documents are needed.

In particular, they argue that tight integration with the host language is an advantage as you can more easily pass data back and forth:

• As a LaTeX package, PythonTeX allows users to write valid LaTeX documents in which the full power of LaTeX is immediately accessible, rather than hybrid documents that contain LaTeX markup. Unlike in IPython and Pweave, it is simple and straightforward to pass information such as page dimensions from LaTeX into Python. It is even possible to create LaTeX macros that mix LaTeX with other languages.

I think it may be possible to abstract some of that out. I don’t see a strong need for super tight integration to get most of this. But who knows? It’s worth exploring.

Oh Mistune

I am about ready to give up on you.

I gained control of your renderers and they work great! Yay!

Lexers, grammars, all the parsing is a maze of twisty passages where I bark my shins constantly and make no progress.

I tried to enable math support which is a standard contributions and…it’s still not working. I’m going to try this other thing now. As far as I can tell, no one uses the “mistune-contribs” version.

I think this signals the end for me. I’ll have to migrate to something else more managable.

Python Class and Instance Methods Share a Namespace

Which sucks.

Coming from Smalltalk, I expect a lot of polymorphism. Class/instance polymorphism seems pretty obvious…if I send a message to a class it goes to the class for handling. If I send a message to an instance it goes to the instance for handling. the methods can be different for each.

This is really useful for cases where much of the time you have a “fire and forget” method (e.g., serialising an object) for which you want to hide the effort of dorking with an instance in the normal case but sometimes you want to reuse the object. This just happened to me!

class BlackboardDelimitedSerialiser(ExamVisitor):
def serialise_to(self, exam, path):
with path.open('r') as f:
writer = csv.writer(f, delimiter='\t', dialect='excel', quoting=csv.QUOTE_NONE, escapechar='"')
for q_row in self.visit(exam):
writer.write(q_row)

To use this as a one off I have to first instantiate the object. So something like:

bbserialiser = BlackboardDelimitedSerialiser()
bbserialiser.serialise_to(exam, path)

Since in my command line tool, I just do this serialisation once, I hoped to add a convenience class method:

class BlackboardDelimitedSerialiser(ExamVisitor):
@classmethod
def serialise_to(cls, exam, path):
serialiser = cls()
serialiser.serialise_to(exam, path)

def serialise_to(self, exam, path):
with path.open('r') as f:
writer = csv.writer(f, delimiter='\t', dialect='excel', quoting=csv.QUOTE_NONE, escapechar='"')
for q_row in self.visit(exam):
writer.write(q_row)

So my call site would look like:

BlackboardDelimitedSerialiser.serialise_to(exam, path)

But all methods share a namespace so the instance method (which is lexically last) wins.

Boo! I don’t want the names to be different! I started hacking on clever, horrible tricks when I realised that I was being completely daft for this case. If I just add instantiating parens, the call site looks like:

BlackboardDelimitedSerialiser().serialise_to(exam, path)

So my use case was bad. There’s no need for a class method so better not to have the extra method. This won’t be true for other cases (where I need some extra logic in the class method), but let’s not borrow trouble. After all, it’s not even clear that I’ll ever need to reuse the object, so having the class method and an uglier name for the instance method would be fine.

Terminal Funkinesses

Two terminal graphicsy thingies crossed my radar:

1. Brow.sh, a terminal front end for the Web. It uses a headless browser backend in a separate process and renders everything to the terminal even graphics!
2. On a much smaller scale, there’s termgraph.py, a library for rendering bar graphs to the terminal.

I’m pretty excited about the second because, as I’ve whined before, getting a simple graph out of Python is way more difficult than it should be.