Beyond JSON

JSON pretty clearly won and won big. This is perhaps inevitable given the dominance of Javascript. And that’s ok! JSON is a pretty decent sort of sexpr and having both lists and dicts makes it pretty useful for quicky externalisation of all sorts of data. The friction, in a typical scripting derived language, of manipulating JSON in memory is super low. Slinging dicts and lists is something any Python (or Javascript, or…) programmer is going to find congenial.

But it has some infelicities and I just don’t mean the lack of query and schema languages (which is sorta being addressed). JSON is rather annoying to hand author and doesn’t seem great for documents and document formats. Or even hacking existing documents like HTML…if only because there’s no standard reflection of HTML structure into JSON.

There are some moves to improve this situation.

JSON5 tackles the writability. Probably the biggest move is not having to quote (certain) keys in objects. That helps both reading and writing! For reading, there’s a clear visual difference between key strings and “value” strings. For writing, less quoting!!

The other big one is multi-line strings (with the ‘\’ as the continuation character). Having to have a continuation character sucks, but it’s much better than the status quo ante.

Comments are also a good idea! The rest seem minor, but these definitely make a difference.

Mark Notation is aimed at bringing XML-like structuring and extensibility to JSON. It does this by adding a key syntactic (and semantic!) extension, the Mark object which is a name (thing tag/element name), a set of properties (think attributes, but with complex contents), and a list of content objects (think child content). It builds on JSON5 so has those authoring felicities.

Semantically, Mark objects get mapped into pretty simple Javascript objects. I don’t fully understand this claim:

contents: an ordered list of content objects, which are like child nodes of elements in HTML/XML. Mark utilizes a novel feature of JS that JS object can be array-like. It can store both named properties and indexed properties.

I don’t see why this matters as you have a special Mark object with has an explicit contents variable. Ah, maybe:

properties: can be accessed through markObj.prop or markObj[‘prop’] when prop is not a proper JS identifier. You can also use JS for … in loop to iterate through the properties. Unlike normal JS array, Mark object has been specially constructed so that Mark contents are not enumerable, thus do not appear in for … in loop.
contents: can be accessed through markObj[index]. You can also use JS for … of loop to iterate through the content items.

So you don’t have to do a field access but just can use special loops. I don’t see that this would be painful in, say, Python even with field accessing. I might default to making Python Mark Objects iteratable over the contents (on the theory that that’s more “normal”).

It would be interesting to compare APIs to see whether this really brings JSON like easy of programmer use.

And, of course, there’s YAML, which you can think of as JSON++++. (JSON is a subset these days.) It’s designed from the ground up for writability and capturing complex structures. And that it does. The price is considerably more complexity. Like a ton more. (You can define entity like things (actually, more like IDs) on the fly! Inline!) It has excellent embedded complex multiline strings (essentially “here-strings“).

I have to say that it might be easier to grow Mark Notation a bit toward YAML than the reverse. Here-like-strings plus id references go a long way.

 

Advertisements

Some Results On Student Evaluations

Erik posted about student evaluations last week and I took the opportunity to bring up the  famous Air Force Academy study. (the paper; an accessible blog post about it).

The setup is extraordinary:

Prior to the start of the freshman academic year, students take course placement exams in mathematics, chemistry, and select foreign languages. Scores on these exams are used to place students into the appropriate starting core courses (i.e., remedial math, Calculus I, Calculus II, etc.). Conditional on course placement, the USAFA registrar employs a stratified random assignment algorithm to place students into sections within each course/semester. The algorithm first assigns all female students evenly throughout all offered sections, then places male-recruited athletes, and then assigns all remaining students. Within each group (i.e., female, male athlete, and all remaining males), assignments are random with respect to academic ability and professor.12 Thus, students throughout their 4 years of study have no ability to choose their professors in required core courses. Faculty members teaching the same course use an identical syllabus and give the same exams during a common testing period. These institutional characteristics assure that there is no self-selection of students into (or out of) courses or toward certain professors.

They focused on math classes because the grades are highly normalised:

The integrity of our results depends on the percentage of points earned in core courses being a consistent measure of relative achievement across students. The manner in which student scores are determined at USAFAparticularly in the Math Department, allows us to rule out potential mechanisms for our results. Math professors grade only a small proportion of their own students’ exams, vastly reducing the ability of “easy” or “hard” grading professors to affect their students’ scores. All math exams are jointly graded by all professors teaching the course during that semester in “grading parties,” where Professor A grades question 1 and Professor B grades question 2 for all students taking the course. These aspects of grading allow us to rule out the possibility that professors have varying grading standards for equal student performance. Hence, our results are likely driven by the manner in which the course is taught by each professor.

In some core courses at USAFA, 5–10 percent of the overall course grade is earned by professor/section-specific quizzes and/or class participation. However, for the period of our study, the introductory calculus course at USAFA did not allow for any professor-specific assignments or quizzes. Thus, potential “bleeding heart” professors had no discretion to boost grades or to keep their students from failing their courses. For this reason, we present results in this study for the introductory calculus course and follow-on courses that require introductory calculus as a prerequisite.

This is really amazing! It’s pretty close to a highly controlled experiment!

They found a pretty strong effect from instructor quality:

The USAFA’s comprehensive core curriculum provides a unique opportunity to test how introductory course professors affect follow-on course achievement free from selection bias. The estimate of 2 Var (lj 1) is shown in row 2, column 2 of table 4 and indicates that introductory course professors significantly affect follow-on course achievement.19 The variance in follow-on course value-added is estimated to be 0.0025 (SD p 0.050). The magnitude of this effect is roughly equivalent to that estimated in the contemporaneous course and indicates that a one standard-deviation change in introductory professor quality results in a 0.05-standard-deviation change in follow-on course achievement.

The striking bit (for me) is their examination of student evaluation and grades:

Next, we examine the relationship between student evaluations of professors and student academic achievement as in Weinberg, Hashimoto, and Fleisher (2009). This analysis gives us a unique opportunity to compare the relationship between value-added models (currently used to measure primary and secondary teacher quality) and student evaluations (currently used to measure postsecondary teacher quality).

…In column 1, results for contemporaneous value-added are positive and statistically significant at the.05 level for scores on all six student evaluation questions. In contrast, results in column 2 for follow-on course value-added show that all six coefficients are negative, with three significant at the .05 level and three significant at the .10 level .Since proposals for teacher merit pay are often based on contemporaneous teacher value-added, we examine rank orders between our professor value-added estimates and student evaluation scores. We compute rank orders of career average student evaluation data for the question, “The instructor’s effectiveness in facilitating my learning in the course was,” by professor… As an illustration, the calculus professor in our sample 1 who ranks dead last in deep learning ranks sixth and seventh best in student evaluations and contemporaneous value-added, respectively.

Our findings show that introductory calculus professors significantly affect student achievement in both the contemporaneous course being taught and the follow-on related curriculum. However, these methodologies yield very different conclusions regarding which professors are measured as high quality, depending on the outcome of interest used. We find that less experienced and less qualified professors produce students who perform significantly better in the contemporaneous course being taught, whereas more experienced and highly qualified professors produce students who perform better in the follow-on related curriculum. Owing to the complexities of the education production function, where both students and faculty engage in optimizing behavior, we can only speculate as to the mechanism by which these effects may operate. Similar to elementary and secondary school teachers, who often have advance knowledge of assessment content in high-stakes testing systems, all professors teaching a given course at USAFA have an advance copy of the exam before it is given. Hence, educators in both settings must choose how much time to allocate to tasks that have great value for raising current scores but may have little value for lasting knowledge.

And the key bit:

Regardless of how these effects may operate, our results show that student evaluations reward professors who increase achievement in the contemporaneous course being taught, not those who increase deep learning. Using our various measures of teacher quality to rank-order teachers leads to profoundly different results. Since many U.S. colleges and universities use student evaluations as a measurement of teaching quality for academic promotion and tenure decisions, this finding draws into question the value and accuracy of this practice

Now, in the comment thread, people pointed out that this result, while strong, was narrow (restricted to fairly low level math and “adjacent” courses). These facts are related, of course. By focusing on the data with the fewest confounders, they narrowed the scope but increased the strength. However, there have been subsequent students. For example, “Evaluating students’ evaluations of professors“:

The empirical analysis is based on data for one enrollment cohort of undergraduate students at Bocconi University, an Italian private institution of tertiary education offering degree programs in economics, management, public policy and law. We select the cohort of the 1998/1999 freshmen because it is the only one available where students were randomly allocated to teaching classes for each of their compulsory courses.
The students entering Bocconi in the 1998/1999 academic year were offered 7 different degree programs but only three of them attracted enough students to require the splitting of lectures into more than one class: Management, Economics and Law&Management. Students in these programs were required to take a fixed sequence of compulsory courses that span over the first two years, a good part of their third year and, in a few cases, also their last year.

The exam questions were also the same for all students (within degree program), regardless of their classes. Specifically, one of the teachers in each course (normally a senior faculty member) acted as a coordinator, making sure that all classes progressed similarly during the term and addressing problems that might have arisen. The coordinator also prepared the exam paper, which was administered to all classes. Grading was delegated to the individualteachers, each ofthem marking the papers ofthe students in his/her own class. The coordinator would check that the distributions were similar across classes but grades were not curved, neither across nor within classes.

They also looked at evaluation/grade correlations:

In this section we investigate the relationship between our measures of teaching effectiveness and the evaluations teachers receive from their students. We concentrate on two core items from the evaluation questionnaires, namely overall teaching quality and the overall clarity of the lectures.

The key hit:

Our benchmark class effects are negatively associated with all the items that we consider, suggesting that teachers who are more effective in promoting future performance receive worse evaluations from their students. This relationship is statistically significant for all items (but logistics), and is of sizable magnitude. For example, a one-standard deviation increase in teacher effectiveness reduces the students’ evaluations of overall teaching quality by about 50% of a standard deviation. Such an effect could move a teacher who would otherwise receive a median evaluation down to the 31st percentile of the distribution. Effects of slightly smaller magnitude can be computed for lecturing clarity.

Finally, I did peek at a 2017 meta-analysis which overrides earlier meta-analyses. The abstract:

Student evaluation of teaching (SET) ratings are used to evaluate faculty’s teaching effectiveness based on a widespread belief that students learn more from highly rated professors. The key evidence cited in support of this belief are meta-analyses of multisection studies showing small-to-moderate correlations between SET ratings and student achievement (e.g., Cohen, 1980, 1981; Feldman, 1989). We re-analyzed previously published meta-analyses of the multisection studies and found that their findings were an artifact of small sample sized studies and publication bias. Whereas the small sample sized studies showed large and moderate correlation, the large sample sized studies showed no or only minimal correlation between SET ratings and learning. Our up-to-date meta-analysis of all multisection studies revealed no significant correlations between the SET ratings and learning. These findings suggest that institutions focused on student learning and career success may want to abandon SET ratings as a measure of faculty’s teaching effectiveness.

I mean, it’s brutal:

In combination, our new up-to-date meta-analyses based on nearly 100 multisection studies, as well as our re-analyses of the previous meta-analyses make it clear that the previous reports of “moderate” and “substantial” SET/learning correlations were artifacts of small size study effects. The best evidence − the meta-analyses of SET/learning correlations when prior learning/ability are taken into account − indicates that the SET/learning correlation is zero. Contrary to a multitude of reviews, reports, as well as self-help books aimed at new professors (a few of them quoted above), the simple scatterplots as well as more sophisticated meta-analyses methods indicate that students do not learn more from professors who receive higher SET ratings.

And student evaluatons are a substantial part of the Teaching Evaluation Framework.

So that’s really bad!

When we throw in gender (and other) biases, it seems clear we have a huge problem.

Worse is Better and Back Again

Richard Gabriel, 1991

I and just about every designer of Common Lisp and CLOS has had extreme exposure to the MIT/Stanford style of design. The essence of this style can be captured by the phrase the right thing. To such a designer it is important to get all of the following characteristics right:

  • Simplicity — the design must be simple, both in implementation and interface. It is more important for the interface to be simple than the implementation.
  • Correctness — the design must be correct in all observable aspects. Incorrectness is simply not allowed.
  • Consistency — the design must not be inconsistent. A design is allowed to be slightly less simple and less complete to avoid inconsistency. Consistency is as important as correctness.
  • Completeness — the design must cover as many important situations as is practical. All reasonably expected cases must be covered. Simplicity is not allowed to overly reduce completeness.

I believe most people would agree that these are good characteristics. I will call the use of this philosophy of design the MIT approach Common Lisp (with CLOS) and Scheme represent the MIT approach to design and implementation.

The worse-is-better philosophy is only slightly different:

  • Simplicity — the design must be simple, both in implementation and interface. It is more important for the implementation to be simple than the interface. Simplicity is the most important consideration in a design.
  • Correctness — the design must be correct in all observable aspects. It is slightly better to be simple than correct.
  • Consistency — the design must not be overly inconsistent. Consistency can be sacrificed for simplicity in some cases, but it is better to drop those parts of the design that deal with less common circumstances than to introduce either implementational complexity or inconsistency.
  • Completeness — the design must cover as many important situations as is practical. All reasonably expected cases should be covered. Completeness can be sacrificed in favor of any other quality. In fact, completeness must be sacrificed whenever implementation simplicity is jeopardized. Consistency can be sacrificed to achieve completeness if simplicity is retained; especially worthless is consistency of interface.

Early Unix and C are examples of the use of this school of design, and I will call the use of this design strategy the New Jersey approach I have intentionally caricatured the worse-is-better philosophy to convince you that it is obviously a bad philosophy and that the New Jersey approach is a bad approach.

However, I believe that worse-is-better, even in its strawman form, has better survival characteristics than the-right-thing, and that the New Jersey approach when used for software is a better approach than the MIT approach.

Olin Shivers, 1998

* Preamble: 100% and 80% solutions
———————————-
There’s a problem with tool design in the free software and academic
community. The tool designers are usually people who are building tools for
some larger goal. For example, let’s take the case of someone who wants to do
web hacking in Scheme. His Scheme system doesn’t have a sockets interface, so
he sits down and hacks one up for his particular Scheme implementation. Now,
socket API’s are not what this programmer is interested in; he wants to get on
with things and hack the exciting stuff — his real interest is Web services.
So he does a quick 80% job, which is adequate to get him up and running, and
then he’s on to his orignal goal.

Unfortunately, his quickly-built socket interface isn’t general. It just
covers the bits this particular hacker needed for his applications. So the
next guy that comes along and needs a socket interface can’t use this one.
Not only does it lack coverage, but the deep structure wasn’t thought out well
enough to allow for quality extension. So *he* does his *own* 80%
implementation. Five hackers later, five different, incompatible, ungeneral
implementations had been built. No one can use each others code.

The alternate way systems like this end up going over a cliff is that the
initial 80% system gets patched over and over again by subsequent hackers, and
what results is 80% bandaids and 20% structured code. When systems evolve
organically, it’s unsuprising and unavoidable that what one ends up with is a
horrible design — consider the DOS -> Win95 path.

As an alternative to five hackers doing five 80% solutions of the same
problem, we would be better off if each programmer picked a different task,
and really thought it through — a 100% solution. Then each time a programmer
solved a problem, no one else would have to redo the effort. Of course, it’s
true that 100% solutions are significantly harder to design and build than 80%
solutions. But they have one tremendous labor-savings advantage: you don’t
have to constantly reinvent the wheel. The up-front investment buys you
forward progress; you aren’t trapped endlessly reinventing the same awkward
wheel.

But here’s what I’d really like: instead of tweaking regexps, you go do your
own 100% design or two. Because I’d like to use them. If everyone does just
one, then that’s all anyone has to do.

Kevlin Henney, 2017:

A common problem in component frameworks, class libraries, foundation services, and other infrastructure code is that many are designed to be general purpose without reference to concrete applications. This leads to a dizzying array of options and possibilities that are often unused or misused — or just not useful.

Generally, developers work on specific systems; specifically, the quest for unbounded generality rarely serves them well (if at all). The best route to generality is through understanding known, specific examples, focusing on their essence to find an essential common solution. Simplicity through experience rather than generality through guesswork.

Speculative generality accumulates baggage that becomes difficult or impossible to shift, thereby adding to the accidental complexity those in development must face in future.

Although many architects value generality, it should not be unconditional. People do not on the whole pay for — or need — generality: they tend to have a specific situation, and it is a solution to that specific situation that has value.

We can find generality and flexibility in trying to deliver specific solutions, but if we weigh anchor and forget the specifics too soon, we end up adrift in a sea of nebulous possibilities, a world of tricky configuration options, overloaded and overburdened parameter lists, long-winded interfaces, and not-quite-right abstractions. In pursuit of arbitrary flexibility, you can often lose valuable properties — whether intended or accidental — of alternative, simpler designs.

Ok, the last one is a bit more…specific…than the first two. But it’s fun to read it in juxtaposition with the first two. One way to try bridge the difference between Henney and Shivers is to not that Shivers is saying that we need more 100% designs and Henney is saying that we need a lot of specific experience to get to a good 100% design. But then the differences becomes stronger…Shivers doesn’t want people to hack up a bunch of 80% solutions while Henney, roughly, thinks we have to have them before we have a hope for a right 100% one.

My heart is with Shivers, but my head is with Henney.

I think I have some readings and an exam question for next year’s class.

Ah, Grading

Lost track of posting and lots of other things due to the whelm being over but not done. Some of the grading is going OK. Exams should be sortable. I hope to be back on posting track tomorrow.

Blackboard Learn 9.x Fail Encore

Last year, our installation of Blackboard could upload grades from a spreadsheet. So you could grade offline! Which is good, because we want to grade off line, esp. programs. But boo! You couldn’t upload feedback so even though the feedback was sitting in a column ready to go, we had to cut and paste it in. BOO!

But then, in spring, a service pack made it possible up load  (and download!) feedback. WOO! This is good! I can grade offline! I can use my tools! I can analyze stuff!

Except I now figure out that if I have multiple question tests, I can’t upload feedback OR MARKS for individuals questions in the test. Which, for something like, oh, I don’t know, a FINAL EXAM is a big deal.

It’s also going to suck for giving feedback. Lots of cutting and pasting in my future.

Software as a service folks of the world, there is a MINIMUM REQUIREMENT on you: Make sure your users can export and import your data. Easily. Very easily. Make it easy, ok? Use freaking XML if you have to. Just make it easy. From day 1. Until day always. For proper bonus points, make sure that simple things can be done simply. But if not that, just make sure we can do it.

Users of SAAS, demand this. DEMAND IT. If they can do it, you should worry.

A Cautionary Tale

It’s hard being a PhD student.

Having been one for quite a long time, I can speak quite passionately about it. Being a passionate person entails that I probably will at the drop of a hat.

Of course, lots of the difficulties with being a PhD student are simply a matter of life. I take a special interest because it was a defining condition of so much of much life and mentoring PhD students will is and will be such a condition for the rest of my life. So when I see a massive failure by a PhD student, I’m inclined to overreflect on it.

Kindred Winecoff posted quite a silly critique of Paul Krugman which was picked up by Henry Farrell. Now, Daniel Drezner has a similar, somewhat more nuanced view expressed with rather less vitrol and hyperbole. They share the same basic flaw: A hugely uncharitable misreading of Krugman as saying that the public bears absolutely no responsibility for since it had no influence on the massively disastrous Bush and Bush era policies. (I’m risking similar problems by not doing a very close exegesis of any of the articles. Furthermore, my generally pro-Krugman bent generates similar risks as Winecoff’s anti-Krugman bent.)

(The big error in this reading, AFAICT, is to miss the dialectic at several levels. The line Krugman is pushing back against is the one which justifies austerity measures with a massive negative effect on the poor and powerless along with irresponsible give aways to the rich and powerful. While there are piles of crap justifications, the key one here is that the public is irresponsible and the elites are relatively helpless in the face of massive public irresponsibility. (Think Santelli.) Whatever responsibility the public bears, I trust that it’s pretty obvious that this line is total nonsense and that’s Krugman’s core point. And, frankly, it’s the interesting point.)

Winecroff is now in a trap of their own making (yes, like Jane Austen, use the 3rd person plural as a neutral 3rd person singular). They gave a junky critique based on a junky reading and littered it with junky hyperbole, e.g.,

If Greenspan’s “with notably rare exceptions” deserves internet infamy, and it does, then surely Krugman’s less notable exceptions should too.

(Even if the junky reading were correct these are not remotely comparable. If the junky reading were correct, Krugman would be wrong (this is what Drezner tries, rather crappily afaict, to show). Greenspan is engaged in a kind of amazing and disgusting chuzpah in the service of some rather dangerous hackery.)

When appropriately (and gently!) chastized by Farrell, Winecroff fails to do the sensible thing that many commentators urged him to do: Take a moment, reflect, and back down. Instead, Winecroff doubles- and trebles-down on the silliness. The silliness is as every level including a classic “I’m leaving thread now” followed almost immediately by several more comments.

All this is relatively minor in the grand scheme of things: In the midst of an event like this, it’s really hard to turn oneself around. But given the systematic failures exhibited, I wonder if Winecroff is going to learn from it. If I were his supervisor (US: advisor), I would print all these out and go through them carefully. I’d probably focus more on the dialectic issues (e.g., problems with burden of proof, charity, self-awareness, tactics, and strategy, etc.). For example, it’s very unclear what Winecroff hopes to get out of the exchange. I’m afraid that bashing Krugman is core, which is really a worthless goal, esp. in this context. An easy win would have been to say, “Ok, let’s put my reading of Krugman aside (I’m not ready to give up on it, but maybe that’s because I really can’t stand him; I have to let that rest for awhile) and focus on the more interesting question of how to apportion responsibility for policy.”

This only wins if making the point is more important than making the bash. Which is why it’s a good move regardless of your goal if you are in hostile territory. It sidelines bashback for a while in favor of counterpoint. Given enough point and counterpoint, you might find your own goal moving from bashing to pointmaking. (This is not to say that bashing is worthless. Sometimes it’s very worthwhile indeed. But it needs to work, at the very least.)

As I said, Winecroff isn’t irrecoverable. I had a similar (more heated) exchange with a random PhD student on the web and they turned out just fine and we’re reasonble colleagues (I’m still a bit wary of them, though). Of course, I had a similar (even more heated) exchange which did not resolve favorably. If you find yourself in this circumstance, get as much reality checking as you can. Reflect. Talk to other (possibly critical) people. Don’t necessarily seek out supportive people, but people who will tell you when you’re off the rails. If you determine you have gone off the rails, apologize and retract and learn from the experience. In particular, learn something about your own strengths, weaknesses, and reactions.

Update: You don’t have to be a student to have major level fail as the Synthese scandal shows. The solution to such fails is the same.

However, the action Frances recommend (apologize first) works best in good faith circumstances. If there’s bad faith or bad blood admitting fault early can really, really screw you. Asking for time to think about it, or putting up similar disclaimers, can be useful. It really is the case that we fallible people sometimes can’t see the obvious. If you aren’t seeing it, then ask for some time to see it. “Hey folks, I’m seeing a lot of heat from people I generally respect but I’m not getting it. Can we hold things for a bit while I figure out for sure what’s going on?” is a reasonable move.

soul needs savin but i’m too damned shallow

(Thanks to Zoe for introducing me to Red (Elegy) and thus giving me this title.)

For my sins, I’m reading up on Virtual Learning Environments. I’m using Blackboard/Vista/WebCT (and they can’t get the name clear…doesn’t say much for the software) because 1) it’s sorta encouraged at the university level and 2) I’m eLearning Champion for the School of Computer Science. Yes, that’s an actual title. The University came up with it.

There’s a lot of that sort of thing in eLearning, I fear.

I also used Moodle rather shallowly and loathed it. I probably would loathe it less than Blackboard if I switched back, if only for the fact that it’s 1) open source, 2) free, and 3) downloadable. We pay tons of money for Blackboard and if I’m going to have a crappy experience I’d at least like the pleasure of being cheap.

Looking around, what I see is a lot of reinvention, and not just school specific stuff like Grades (look! discussion boards! mail! a blog! a calendar). Do we really need all this stuff built into a monolithic (even a modular monolithic) system?

The thing that doesn’t seem to be taken into consideration is portability and longevity. These are closely related. One of the things I set as a goal for the school is to support academic portfolios for our students. For example, if a student wants a letter of recommendation from me, it would be really handy to see what they’ve done. Similarly, I remember finding notes from long ago, and old textbooks, etc. None of these systems facilitate learning beyond the rather specific class they are set up for. That rather sucks. Who would turn to this stuff if they were interested in picking up, oh, Calculus later in life. The MIT OpenCourse stuff and iTunes U get the lectures out, so that’s nice, I suppose. Maybe there is stuff out there that transcends the individual instance of a course.

Maybe something built on Google Wave (for portability).

Dunno if I’ll do anything about it beyond writing this post. See the title!