Stanford AI Report (2019)

It looks pretty well thought out:

  • Between 1998 and 2018, the volume of peer-reviewed AI papers has grown by more than 300%, accounting for 3% of all peer-reviewed journal publications and 9% of published conference papers.
  • China now publishes as many AI journals and conference papers per year as Europe, having passed the USA in 2006. The Field-Weighted Citation Impact of USA publications is still about 50% higher than China’s.  …
  • At the graduate level, AI has rapidly become the most popular specialization among computer science PhD students in North America, with over twice as many students as the second most popular specialization (security/information assurance). In 2018, over 21% of graduating Computer Science PhDs specialized in Artificial Intelligence/Machine Learning.
  • Diversifying AI faculty along gender lines has not shown great progress, with women comprising less than 20% of the new faculty hires in 2018.  Similarly, the share of female AI PhD recipients has remained virtually constant at 20% since 2010 in the US.
  • There is a significant increase in AI-related legislation in congressional records, committee reports, and legislative transcripts around the world.

I like the scope (everything!) and the summary numbers seem plausible. I’d want to poke at things like:

  • 58% of large companies surveyed report adopting AI in at least one function or business unit in 2019, up from 47% in 2018.

because it’s clear from other work that, esp for hypey tech, companies report a lot of meaningless stuff. I mean, what does it even mean to “adopt AI in at least one function or business unit.)

I was at a supermarket and they had a “spills monitoring robot”:

This seems to count as “adopting AI in at least one function or business unit” though, personally, I’d not want to count it as a meaningful adoption. It’s clearly a gimmick.

Alas, the report merely summaries the earlier survey, so you still have to drill down to another report to figure out what’s going on.

Still, used with appropriate caution, this seems useful.

SQL Murder Mystery

This was kinda fun:

There’s been a Murder in SQL City! The SQL Murder Mystery is designed to be both a self-directed lesson to learn SQL concepts and commands and a fun game for experienced SQL users to solve an intriguing crime.

I wouldn’t say it was all that intriguing, but it was fun enough and perhaps not wholly unrealistic.I’d preferred some sort of staged interface (a la a Jupyter notebook…I kept a scratchpad) so I could keep track of what I did in a nice way. But it’s otherwise pretty slick and pleasant.

It uses components from the very interesting Star Select SQL online text. I’ve not evaluated that text fully, but at least the running example isn’t back accounts!

Spreadsheet Errors Deep and Shallow

Spreadsheets are almost certainly the most common form of end user programming. And they are both cool (enabling amazing systems to be built by all sorts of people) and dreadful (those systems are even more buggy than we fear). There’s lots of cool research on them…even inside Microsoft!

But I’ll tell you right now that the thing that fucking kills me with Excel for Mac is that scrolling is just the worse. It’s slow, buggy, and just plain hard to use. I switched to Numbers for dealing with a spreadsheet of less than 10 columns and less than 200 rows just because the scrolling was KILLING ME LOUDLY…

…of course, Numbers doesn’t let you delete specific cells…only whole rows why for the love of every sort of sanity on this good green earth WHY?!?!?

It just boggles me that these simple things are so broken. For generations of versions.

A Bit of Tab Clean Up

My phone has a lot of tabs. These don’t seem to hurt performance they way they do on the desktop but who knows? Gosh some of these have lingered forever:

  • Negativity scene. I find it mildly hilarious and very War on Xmasy, but it’s time to close this tab!
  • RealLab. This seems like a good idea, but I’m unclear if there’s anything for me to do about it. The projects seem…odd.
  • Dwight Davison tiles. Zoe and I would really like some cool tile work for our atrium and bathroom esp the floors. But floor tiles have to be tough and all the pretty tiles aren’t tough. So we have to go with wall tiles for the awesome. These are pretty awesome (look at the bees!) but very expensive.
  • more-itertools. These are pretty neat. I think I needed to do some grouping? I’d love to see something like this rolled into the standard lib.
  • An introduction to auction theory. Seems like a pretty good overview but I had it up for a student who’s long gone! I don’t think it’s uniquely good such that I must horde it in a tab forever.
  • Law of Demeter paper. I must have been considering it for the software engineering class? I’m pretty sure this wouldn’t be a helpful read for them and I don’t discuss Demeter anyway! (Maybe I should?)
  • TinderBox. For years I’ve chatted about “Bijan’s Fantasy Information Management Tool (BFIMT)”. This probably isn’t it but seems worth a peek.

I killed a bunch of others and some I will sadly keep. Well over 100 to go!

Some Query Links

I have a buncha tabs about SQL and query optimizations. So a bit of clean up:

Lots more tabs to be cleaned.

All About A*

Amit Patel’s multi part discussion of A* (esp in games) is a very nice read. Even if you just read the first page, you’ll get a clear picture of various pathfinding algorithms and their trade offs. So later bits aren’t quite as nice to follow (e.g., the trade offs between terms in the heuristic could be fruitfully visually indicated), but overall, it’s great.

It’s part of a series teaching math and computer science via games. Not necessarily wildly original in concept (cf AI: A Modern Approach), but here execution is everything. Check out the discussion of pathfinding for a tower defence game.

Some Good Reads

I really am trying to clean out my tabs and writing something in depth on each isn’t always cutting it. So here’s some quick hits (the link text generally isn’t the title:

  • Burying NoSQL for consistency failures. Essentially the argument is giving up consistency for availability (cf CAP theorem) is a bad move due to increased application complexity AND that many “NewSQL” systems aren’t consistent for a subtle implementation reason.
  • A beautiful performance study of grep tools by the author of ripgrep. Clear, fairly comprehensive, appropriately modest, it seems totally publishable to me. I learned a lot reading it and enjoyed doing so.
  • “Systems programming” != “low level programming” has a nice history of the term and concept. It’d be good to get an analysis of how the phrase “programming in the large” got in.
  • You should read all of Dan Luu, but you’d could do worse than starting with his “Hardware is Unforgiving”.
Four Down, hundreds to go.

PythonTeX

I went down the rabbit hole trying to code up some simple class data analysis tools. I have a custom script for exam results which uses Jinja2 as a template language and generates a Markdown doc. It’s alright, but I figured there must be something ready made that wasn’t a “note book” and could generate documents and reveal.js slides.

Nope. Not so I could use.

I had knitpy in a tab for forever so I though I’d give a spin. It seems both moribund and broken for all my struggling could demonstrate.

I poked at Pweave. I don’t know if I just lost the will to live or I really couldn’t get it working. Either way I was spending more time looking at the tool than pulling in and massaging the data, so I gave up.

Along the way I came across a paper on PythonTeX which has some interesting arguments:

PythonTeX makes possible reproducible LaTeX documents in which Python, Ruby, and Julia code is embedded. This can mitigate the potential for copy-and-paste errors, simplify the creation of figures, and allow significant portions of documents to be generated automatically. Built-in utilities for dependency tracking reduce the need for makefiles or similar systems. Tight LaTeX integration allows code to adapt to its context in a document. User-defined sessions that run in parallel provide high performance. Synchronization of errors and warnings with document line numbers mean that writing and coding remain efficient. Finally, the depythontex utility ensures that PythonTeX may be used even when plain LaTeX documents are needed.

In particular, they argue that tight integration with the host language is an advantage as you can more easily pass data back and forth:

  • As a LaTeX package, PythonTeX allows users to write valid LaTeX documents in which the full power of LaTeX is immediately accessible, rather than hybrid documents that contain LaTeX markup. Unlike in IPython and Pweave, it is simple and straightforward to pass information such as page dimensions from LaTeX into Python. It is even possible to create LaTeX macros that mix LaTeX with other languages.

I think it may be possible to abstract some of that out. I don’t see a strong need for super tight integration to get most of this. But who knows? It’s worth exploring.

A Survey of Online Coding Tutorials

The paper, “A Pedagogical Analysis of Online Coding Tutorials“, provides an analytical review of a sample of online coding tutorials. One of my project students did something similar (they should have published!). The analytical framework is useful but not surprising: they have a set of types (interactive tutorial, MOOCs, web references, etc.) and “nine groups of 24 [analytical] dimensions” including content, organisation, and context. It all seems sensible, though I’m a bit leery. It seems almost too sensible. There’s no empirical work on actual effects (completion, satisfaction, and learning). It’s super tempting to think we can extrapolate from this beautiful and tempting set of features to these effects. Consider their key conclusion:

Our results suggest that most online coding tutorials are still immature and do not yet achieve many key principles in learning sciences. Future research and commercial development needs to better emphasize personalized support and precise, contextualized feedback and explore ways of explaining to learners why and when to use particular coding concepts. Based on our sampled tutorials, we recommend that teachers be very selective in their use of materials, focusing on the more evidence-based tutorials, particularly the educational games. All educational games in the list provide hierarchical structure, immediate feedback, and opportunities that learners actively write code and use subsequent knowledge for coding throughout the tutorial.

But they’re games. What kind of learners are responding to them? Are students responding to them? No coding game that I know off has bubbled up in the popular consciousness or trade press the way say Khan Academy or MOOCs in general have. That doesn’t mean they aren’t educationally superior, but it needs some explanation.

Overall, however, it seems like a solid, worthwhile paper and a good and necessary starting point. Someone needs to too this sort of work and we need more of it.

It’s also the sort of work that needs a dynamic, ongoing database that’s kept up to date with periodicity snapshot papers. One off papers get stale quickly! But it’s eminently replicable, so…have at it!

Eno: Another Data Markup Language

Eno seems interesting. I don’t know that it’s interesting enough to exist over existing formats, but it’s in the YAML/Config format/TOML family and sort of a cousin to Markdown/Settext/Wikiformat (as all these data formats are).

It doesn’t seem to have a spec per se and the documentation is silent about primitive datatypes, which is annoying. Looking at the Python library’s documentation suggests that it uses an on demand coercion approach…basically, you coerce an element when you retrieve it from the parsed structure:

>>> from enopy import loaders, parse

>>> doc = parse("""
...   publish: yes
...   location: 36.987094, -25.091719
...   contact: contact@faulty
... """)

>>> doc.boolean('publish')
True
  
>>> doc.lat_lng('location')
{ 'lat':  36.987094, 'lng': -25.091719 }
  
>>> doc.email('contact')
ValidationError: 'contact' must contain a valid email address, for instance 'jane.doe@eno-lang.org'.

I can’t say I’m a big fan of this API. It seems really dreadful to have the field name as a string inside a coercing function call on the document which magically gets the right “loaders”? I mean, you have to specify the type of everything including built in structured types? Yeek.

Ok there does seem to be a generic interface:

>>> document = enopy.parse(input)

>>> document.element('title')
<class Field name="title" value="Artfest 2018">
>>> document.element('tags')
<class List name="tags" items=2>
>>> document.element('content')
<class Section name="content" items=14>

This feels like a step back from e.g. JSON. And this documentation hurts me!

Still, it’s kinda interesting and one could always layer a different API on top.