Music as meditation

The benefits of meditation on well being are widely studied. It seems natural to me to think that music should have many of the same benefits that meditation has. Both listening and making it, if done with the right kind of single-minded absorption that characterizes meditation. Whereas in meditation the focus is on the self, in music the point of focus is a set of musical ideas; or the succession in time of individual musical moments.

Meditation opens up the door to meta-thinking, as it teaches oneself to clear one’s mind and in doing so makes it easier to realize that one is constantly thinking; it’s easy to just live life from one thought to the next, without exercising much control on that flow; never (or not often enough) stopping and ask why, or doing any other kind of meta-thinking.

Music probably opens up its own set of ideas, those motivated by the underlying structures (elements from music theory that are perceived aurally and enjoyed or just noticed); like a smell that evokes a memory, music at primes parts of our brain to be stimulated more easily later, and thus used in other contexts.

The random seller

A short story set in the future about a seller of randomness. Randomness is highly valued because it’s used for any useful work, and its production is heavily regulated.

People get together to try to produce randomness by throwing dice as it’s the best free way available.

I need to go buy something random”.

Drafts

I’ve got plenty of drafts that I never finish, so I thought I’d just write something and post it right away for a change.

I guess my draft problem is an editing problem. I never sit down to edit — it’s not something that comes naturally to me. I like to write down quick ideas, I guess because writing something down gets me a quick fix. Polishing — my writing, a piece of code, my shoes — has never been my forte. It feels like actual work.

Just after you write something you get to feel sort of OK about all the mistakes that are in it. “It’s just a quick draft, of course it’s clunky”. Excuses like this grow thinner once you’ve actually made an effort at something.

Two feature requests for Google Keep

I wrote this a few months ago, but never got round to editing and publishing. Now I found myself on a long flight with some time to spare, so here goes.

A rant

Google Keep is frustrating — on Chrome sometimes I go to the Keep tab I keep pinned, and it turns out that I’m in a view that doesn’t let me take any single action to get to a “create note” flow. Usually it happens when I’m scrolled somewhere into my stack of notes, or when I have a note I’m already working on. This feels, frankly, very cumbersome to me. With the disclaimer that I don’t know much about UI/UX, and great people work at Google in these fields: I think this is probably not the right UI for this tool. When I get to Keep I’m usually trying to get what I want quickly, and that is one of two actions:

  • Find (to read or edit) a note I’ve written.
  • Write a new note.

Usually I’m in a rush: I’m in the middle of something and I need a note I’ve written in the past to finish the task; or I have just thought of something or found something and I want to write it down lest I forget (a number, a to-do, a random thought — I’m a very forgetful person). I want Keep to be the tool that I can use for organizing my life. It shouldn’t get in the way.

Unfortunately, Keep fails at both flows listed above in at least some ways. First, what I’ve already mentioned — in several contexts, there’s no simple way to get to one of the actions I want to take. I’d rather not deal with extra actions like scrolling and searching for the ‘Take a note’ text area, or pressing escape to exit ‘Search’ when I’ve left Keep on the Search screen, or clicking on different UI elements to do the same. I think ‘Search’ and ‘Write’ should always be available, in every view, in a consistent way. Different always-present buttons (say a ‘+’ to compose, like Gmail does?) seem superior to me.

On top of that, the mobile and web versions are inconsistent; whereas in web ‘Take a note’ eventually shows up at the top of the UI, on mobile it’s at the bottom. They should work the same way in this respect. Perhaps mobile is closer to the right UI here, as at least ‘Search’ and ‘Take a note’ are in clearly distinct places in the UI.

Details all the way down

I say this all not to shame the Keep designers and engineers, which I’m sure are brilliant. Designing UIs and writing software is hard.

Let’s assume for a minute that my gripes are representative of a significant fraction of the user base; if it’s not mine that are representative, there are others like them. The designers and engineers pretty likely have a good feeling for what’s wrong with their app, and in the back of their mind they dream of the time when they will be able to just go ahead and fix it. But the moment keeps slipping — doing anything is hard, harder than you think, and any UX or architectural change probably takes them not just 10x but perhaps 100x of the effort they feel it should take. I know this is how I feel at my own job, at least.

Most software fixes are easy conceptually, but hard in practice. There’s details everywhere that need to be taken care of; details all the way down. Change something into something else. Fix the callers. Fix the tests. Perhaps you need a refactor — that’s fine, our tools nowadays can help there. But they can’t help with writing most of the actual new logic. Or with shipping a larger scale change (something that changes the server architecture) safely. Some day they may — if that ever happens, programmers will be able to focus on different things. And software may be noticeably better.

The wild part

Now for the wild part: I’m not an expert in tooling, but ML-enabled advances in software development tools could make some or most of the steps involved in shipping software changes automatable — or at least assisted. I’m sure researchers are working on some aspects of these hypothetical toolchain that will get us there; and thinking about the others. I don’t have any particular insight into the problem; but I wanted to think a bit about how some of the steps in this process could work, and what it would mean. What it would feel like.

Some wishful thinking: what would happen if ML could tackle parts of what we currently call programming? First of all: programming would probably be redefined, as it usually happens. It may end up being redefined many times as as ML iterates and infringes on this field of human thought, and human activity, the same as elsewhere. Eventually programming may cease to be about C++ or Java, and become more and about reporting the right bugs (in the right format) and sending the right feature requests to some fancy reinforcement learning coder-agent. It will then do the “mindless” thing — write the new method, fix the tests, submit the change and go through the process to shepherd the change through the release process all the way to production. Perhaps even monitor how well it works once it ships. It won’t do everything in the previous list right away, of course, but even if it only helped here and there it would add value; these may turn out to be iterative improvements that happen as we progress in this field. I’m not sure about anybody else, but I sort of am looking forward to all of this, honestly.

How could a next generation code writing assistant look like? One idea might be to augment test-based development; you perhaps write the function signature, then you write tests for it, and of course assert what valid input looks like. Sounds familiar? Expected outputs in unit tests sound like a kind of labelling. A generative model (similar to GPT-2) could presumably be trained on a huge amount of code, and potentially learn the utterances that are most likely to yield the expected output. A programmer could probably look at failed solutions and give feedback on high level issues to be fixed, or mutations to be tried. For example: indicate that more code involving a variable should be written (that the programmer can see needs to be transformed in some way). Or, perhaps, add some intermediate logic that the programmer knows should happen eventually: do something for every element in an array; or define a variable with some descriptive name, as a hint that leads the model along the right path.

Anyway, I’ve added a to-do to my Google Keep list to investigate what the researchers are up to in ML-based code-generation/change assistance. As usual I’m writing mostly naively, and these ideas are very likely very old. But I find writing to be a good way to realize what I’m interested in — what I clearly don’t know, but would like to know.

Flying

I’m really enjoying “Hands-On Machine Learning with Scikit-Learn and Tensorflow” by Aurélien Géron. It doesn’t sound like a page turner immediately, I know, but I’ve been having great fun just reading it cover to cover in this long flight I’m on. I needed a book that gave me a high level overview of the whole field of Machine Learning, and this is it. It was recommended in the Machine Learning podcast I listened to a few months ago.

The first part of the book covers basics and Scikit-Learn — no deep learning in it until page 230. I had heard in several places that it was not a good idea to skip to deep learning even if you think you’re going to end up using deep learning for your models (I think Andrew Ng also mentions this many times), and I can see why; there’s many interesting “shallow” algorithms, and this book covers interesting theory while discussing them. Scikit-Learn also provides a lot of useful goodies that are likely to be used even if you’re mostly using Tensorflow: utility functions, and of course simpler ways of getting shallow models working. I particularly liked the way in which Scikit-Learn lets you set up “pipelines” of transformations and trainers. Finally, Scikit-Learn has great support for decision trees — and it turns out that decision trees are state of the art for many problems, in practice, and have the advantage of yielding explainable (“white box”) models, so there’s that too. I read that Tensorflow supports the Scikit-Learn API, but at this point I’m not sure what that means and I can’t check as I have no internet connection currently on this flight I’m on. I hope you are able to train the whole range of shallow models through it, straight in Tensorflow, as it’d be awkward/annoying to have to set up different systems to train shallow and deep models.

Anyway, I’m now officially in the Tensorflow part of the book and I’m also happy about that. At work I just got to the stage in which I am ready to actually go out and train a model for my first ML-related project, so reading more about Tensorflow in preparation for that has been an exciting way of spending the long flight. Some of it I had already used, but reviewing is how I learn. I’m using the first edition of the book, not the fancy new one that’s about to come out and covers Tensorflow 2, but I think I made the right call by getting the “old” one (from 2016) instead of waiting for the new edition that I knew was coming out. Sure, some parts are likely outdated (it mentions Tensorflow 1.4 as experimental), but the book is working well for me as it is. The background I’m getting should come in handy for my project; this information wouldn’t have been as useful if it had come to me in six months. I’ll probably use Tensorflow 1.7 for my first TF project anyway, so there’s that too. Having said that, I like the book enough that I may get the second edition depending on the reviews it gets (and exactly what changes in it). Reading the updated version would be yet another way of reviewing.

9900 hours to go

Whoa, there go 15 days without posting. It’s funny, how many times have I run into random abandoned blogs where the last few posts begin like this post? I wonder if I’ll quit anytime soon. I don’t feel like quitting really, I’ve just been busy with other things. Let’s wait and see :)

I wrote several things that I thought of posting, then didn’t because I felt they needed work. It’s funny — I don’t really have any regular readers, I think, so it doesn’t make an immediate difference whether I post what I write straight away (and perhaps edit it live) or postpone publishing until I’ve “polished” it (which realistically may never happen).

I feel like I mostly write for archive.org; what I write could end up persisting there, and might be read many years from now. This train of thought led me to write one of the pieces I mentioned. I will probably clean it up a bit and post it after finishing this entry (but then again, maybe not).

Going back to the 15 day long hiatus: I’ve been busy with several things related to Machine Learning — and, to put it succinctly, that makes me happy.

I have this fuzzy long-term plan to learn it well, and a more tactical approach that consists in just keeping an eye out for reasonable opportunities to apply what I’ve learnt hands-on. Now I’ve finally found a project at work that could benefit from ML, so I’ve put aside some of my time to experiment with it. Everything is taking long, as it’s usually the case with programming (for me, anyway), but I really enjoy the process so for once I don’t really mind. It’s such a nice change; I don’t really feel this way about my day job, usually. I don’t think I’ve felt this way about something technical since I was in university.

I’ve also read papers that I found stimulating:

I’ve probably spent 35h over the last three weeks doing ML, and if I remember correctly I’ve enjoyed all of it. Overall I reckon I’ve probably put 100 hours into learning ML in all formats (Coursera and podcasts included) since I started.

Going by the now contended 10000 hours rule of thumb — I have 9900 hours to go.

Strange Finger

PKD generated this, IIRC as a failed prompt (I gave it a prompt and it seemed to ignore it). I liked it enough to post it.

I might do this from time to time, and use the tag ghostwritten for such texts. This one has been edited lightly; I tried not to go too far and remove all the weirdness in it which I think is partly what makes it special/funny/whatever.

A strange hand. The hand was held in a neat, smooth position.

A strange finger.

I pulled up the lid. The hand was gone. I removed it. I opened the door.

The cellar was dark. A soft autumn morning fell on all sides.

I stopped at the door.

I heard the door open. I pushed the door into place. The cellar was silent. The door closed behind it.

I could hear a faint groan. I put the hand in the door lock.

The door opened in a sudden burst of wind. A soft wind blew in from the sky, blowing gently over the gardens.

How’s it going?” I asked.

Pretty good.”

GPT-2 and Philip K. Dick

Since my earlier take on GPT-2 there’s been a steady trickle of articles about experimenting with it — well, I should say GPT-2 small, as GPT-2 hasn’t been released in full as we know. I wonder how much better the current batch of amateur experiments based on GPT-2 would be if they were based on the full model. Anyway.

The past weekend I read these two related articles: gwern’s experiment with generating poetry with GPT-2 and slatestarcodex’s coverage of the same. Inspired by them I decided to go ahead and try my hand at some re-training of my own. I chose corpora that were interesting to me, albeit small — so the risk of overfitting is big:

  1. All of Philip K. Dick’s novels in Project Gutenberg (14), concatenated. I’ll call the resulting model “PKD”.
  2. All of Jane Austen’s from the same source (7), concatenated. I’ll call this “JA”, and probably cover it in a different post.

I count them both among my favourite writers, although clearly they are very different. I trained models on both corpora until cross-entropy was <1. It really didn’t take long — only about 600 iterations as counted by GPT-2’s train.py, around 12 hours of training on CPUs (GPT-2 doesn’t fit in my puny GPU), which seems awfully quick. I’m pretty sure this all means the model is overfitted, but of course I had to give it a try anyway. These are some lightly cherry picked results I got, but otherwise representative examples (say, 75th percentile quality wise, instead of the 50th percentile that you’d expect from a perfectly uniform sampling?).

Unconditional Samples

Some unprompted results first:

Reinhart touched his wristwatch. “Keep moving. Wait until you see something.” He pushed through the tracking screen, into the main section of the ship.

I see what he’s doing now. He started in a new city called Sherikov’s, about two hundred miles off. We’re sure he’s been silent enough to him to know something was coming. He started up the ramp at the same time.”

Thank God.” Reinhart snapped the screen off. “He’ll come after you, Commissioner. After all, he’s just trying to get his control.”

The characters mentioned all come from novels in the corpus, and most of the verbs are used somewhere in it — but I didn’t spot anything memorized here, and names and parts of speech are being recombined in ways that sometimes make sense. I didn’t write down at which stage in training this sample was generated, unfortunately, but the model seemed to focus a lot on one or two novels at an earlier stage — likely because it was training on those first? I have written down a to-do to understand how the training samples are being generated.

Nothing but a ruined building. A group of guards standing around him. A City beyond all Terrans—and his possessions. A place far off, a City set in the sky with a certain beauty.

I was impressed by “A place far off, a City set in the sky with a certain beauty.” I checked the corpus and this wasn’t memorized. This is one of the phrases that made me think “I could use that” — something a writing assistant could produce for me, as per the idea in my earlier post.

No. All she remembers is that he kept a goat.”

Oh?”

He kept a goat?”

I remember he kept a goat. I was an officer in the goat’s fleet. He was a little by the way. When we were younger he kept a goat.”

PKD (the person) had a thing with animals (and paranoia, and mutants, and parts of the female anatomy…). I haven’t read the novel in the corpus where the goat reference comes from, but I feel this managed somehow to capture some of his weirdness with animals.

Major nodded. “That’s what we’re built to carry. The passengers are the very last to go, the passengers, the very last to leave Terra. All this is part of aplan.”

What are you thinking about?”

The ship is guided by a living creature called by radioactive breath. But the pilot, none of us knows what it is.”

Nice creepy/suspenseful last phrase here.

So what’s the matter?” Kramer said.

The brain is not conscious. It’s only a habit.”

And what does it matter?” Gross said.

It’s the same old story, old and very old. But how can we break from the surface? Too old.”

Ominous. As usual, I checked and none of the “ideas” here come straight from the corpus. So PKD seems to be working quite well so far.

Conditional Samples

I couldn’t resist prompting PKD (the model) with one of my favourite lines from PKD (the writer) that are not in this corpus. This comes from “A Scanner Darkly”, which is not yet in the public domain.

Model prompt >>> Stooping down, Bruce picked one of the stubbled blue plants, then placed it in his right shoe, slipping it down out of sight. A present for my friends, he thought, and looked forward in side his mind, where no one could see, to Thanksgiving.

He watched the whole thing from behind the wheel, the engine running through the fog, and the little girl slowly walking toward the car. He looked out the window.

Bruce sat there, silently, for a moment. He remembered his father’s birthday party. He saw his father in the car. He felt sad. He felt sick. He felt miserable. He knew it was just a bad day. He felt hor rible. He didn’t even know that. He knew the pain and the guilt that kept him from coming out. He needed something to live for.

He looked back down at the sun. It was cold outside. It was hard to see when his feet were shaking from the wind. He couldn’t even remember the last time he had felt this way. He couldn’t even remember the day he left his family and moved to Ohio. He couldn’t remember how he felt. He couldn’t even remember the day he moved to Ohio.

Hey, Bruce,” he said to himself. He looked at his father. He looked at his mother, who was holding the baby. He looked at his mother, who was crying. He looked at his father, who had been in the car fo r more than an hour, trying to calm his tears. He looked at his father, who was still weeping. He looked at his mother, who was crying. He looked at his mother, who was holding the baby.

Hey, Bruce, I’m sorry,” his mother said to Bruce. “Sorry that I didn’t do my best to help you.”

I think this result only sort of works. The rhythm of the phrases is quite good, even if it of course goes into a loop there at the end. It remembers the name. But it seems to forget about some of the other things that could have been relevant (and, in the novel, are): the plants, and his friends.

It begins to show what I believe is one of the issues with this model: it doesn’t seem to react to “external” prompt elements (defined as names, or things, that aren’t in the “retraining” corpus). For the sake of brevity I’ll skip them here, but I have many examples of ignored prompts with outputs that look exactly like an unconditional sample. It’s like PKD has forgotten about the world outside the new corpus.

I have the impression this is because it’s over-fitted, so it sort of tends to take a shortcut from any prompt back into one of its deep grooves — so to speak. But I still don’t understand GPT-2 or many others sophisticated models — I’m more at the stage of implementing linear/logistic regression, solving MNIST in Tensorflow, and reading introductory books — so I’m unsure if it’s that or some deeper issue; I don’t know how the process of retraining on a new corpus actually works yet (another to-do there). I think the knowledge of grammar and the language that GPT-2 brings to the table is still there, but other things are just gone. Hopefully I will get to a point in my ML studies where I can understand this problem better — perhaps in a few months?

Comparing GPT-2 and PKD

To test this hypothesis, anyway, I tried a few prompts in both standard GPT-2 small (which usually does quite well at picking up prompt queues) and PKD. In the following section, PKD goes first and GPT-2 “vanilla” goes second:

Borges

Model prompt >>> My favourite short story of Borges is

that he would do something; you read, you took this book, this universe, and turned it around around.”

Wait,” my mother said. “What did he say?”

Nothing.” He watched her struggling, trying to remember. “He said he’d do something.”

Wait? What? What? What did he say?”

Nothing.” He watched her silently. “What was he talking about?”

Not much here.

Model prompt >>> My favourite short story of Borges is “The Time-Lost Manuscript,” in which a ship’s captain (Robert Duvall) becomes the villain of a short story, and the book’s author is a character who never really existed. That’s kind of the big story here. I think Borges was a very good, very funny man (and also one of mine). He had a very well received reputation for his wit, his writing. And he would make great characters out of his characters. But sometimes, though, you just don’t believe in the character.

[Crowd applause]

The crowd bit cracked me up. Not sure where it came from — perhaps GPT-2 dreamt this up as a sort of talk show interview? But in any case the effect is funny. Also, “the book’s author is a character who never really existed” actually does sound like Borges or Calvino. So… a hit for GPT-2, overall.

Fan fiction

Model prompt >>> Darth Vader and Luke walk into a bar.

The vidsender—it doesn’t work.”

How did you happen to have an A-class leady come along?”

The same old story. The leady after a heavy-set Erickson put a couple of systems back in the center of the street. The story doesn’t fit.

They say you didn’t build it up, didn’t they?”

That’s true,” Bob said. “You didn’t build it up up yourselves. We’re going to have to do it alone. And if you do the whole thing again they’ll be surprised.” Once again, PKD just ignores the prompt as far as I can tell.

Model prompt >>> Darth Vader and Luke walk into a bar.

They’re talking about the latest episode of the hit Star Wars series, “Luke Cage,” which stars Luke Cage and Ben Affleck.

I thought, ‘There’s no way they’re going to go to jail, and that’s the only way they’re going to get out,’” a source told THR. “But it’s not like the show actually airs anymore. This was a time when Luke Cage was just something that was a part of the story and he was a part of the story. That’s not how it’s going to be.”

Here GPT-2 got the Vader right (it produces a mention of Star Wars), but the Luke wrong. But it’s something; it was at least partially “hearing” me.

Model prompt >>> Darth Vader looked at Harry Potter wistfully.

Do you think he—”

The Old Man. I know he won’t. I know it. I’ve seen his every vlog.”

How many of you have been watching?”

I can count on one count.” He reached into his coat. “Letters, all right.”

The manila envelope was filling up. The Russians pulled him out of the room, toward the whiteboard. The message plate was on the wall, up the side of the wall.

PKD just packed up and went back to familiar territory.

Model prompt >>> Darth Vader looked at Harry Potter wistfully.

Darth, I can’t take this from you, you know you’re a very handsome man,” he said. “I need to see some respect for this boy, and the respect that his family has for you.”

Harry Potter looked at Voldemort and said, “I’m just going to put my wand on him so I can take him to Dumbledore and get his name out of his mind, but I’m not going to tell him what to do.”

Harry Potter said, “The truth is, you’re not going to tell me what to do, but you’re going to ask him questions. He’s going to ask questions, he’s going to ask questions about you.”

GPT-2 nailed it. GPT-2 really jumps at the opportunity to write fanfic.

Conclusion

PKD is always all PKD, and little or no GPT-2 — it shows no general knowledge of the world. It’s good for generating unconditional mashups of the corpus I’ve used, sometimes even interesting ones, but cannot react to a wide variety of prompts.

It seems I’ve succeeded at overwriting GPT-2 with PKD, at least in some sense. What I wanted to see from PKD’s output in the previous section was Star Wars fan fiction or a Star Wars-Harry Potter erotic crossover (both of which clearly exist somewhere in the internet), written by PKD (ideally the human)… but I just got the ramblings of PKD (the model) instead. PKD is locked inside a room (a Chinese Room? Perhaps a Russian or Japanese one, in the case of PKD) producing rehashes of his old work instead of hearing me reaching out to him from the future.

PKD could produce a pastiche in which Darth Vader is a bit paranoid and fears his ex-wife, Padme, is collaborating with the Russians. But it doesn’t.

Not yet, anyway.

Update: 2019-03-26

By now I’m pretty sure that there was something wrong with my first model’s training. After updating to the latest version in nshepperd’s github, I started a new training batch and I noticed that 1) it’s significantly slower to train now, and 2) the samples in early iterations read a lot more like GPT-2, whereas before they looked like direct derivations from the “retraining” corpus from the get go.

These two things make me believe that there was a bug in my training process, or a disconnect between what I was passing to train.py and what it actually used (argument processing was one of the things that changed between versions). Could it be that I was training the model from scratch on the new corpus? I need to run some more experiments to see if this would even work with GPT-2, it seems like the training I did was not nearly enough to be able to produce relatively convincing text on its own. In any case, this may all invalidate my complaints about lack of generality above :)

I quickly retrained PKD to only a loss of ~2 and gave it another whirl. I got this:

Model prompt >>> Harry Potter looked at the rocket and said: “This is a failure. Something has gone wrong. What are we going to do?”

Beside him, the others were moving back.

I’m afraid not,” Gryffindor Potter said. “I doubt it. We might as well go to the other side. After all, this is Terra and Mars.”

Well,” said the Professor, “we’ll just have to figure out what we’re going to do.”

They were moving away from the bunker, and he could see the huge projectiles moving away from the bunker. The Professor moved toward them.

There it is,” he murmured. “I suppose we’ll find a way back. But—”

The Professor looked around. “It’s all right, Professor. The whole place is wiped out. The whole town is decimated, destroyed and leveled, with the ruins of a village, the ruins of a town. The only way we can get back is by the use of the Terrans.”

This is a decent enough cross-over. PKD2 latches on to the Harry Potter reference and takes names from it, but thematically this is otherwise close to PKD. So even though the fragment is not particularly interesting to read, I think it’s an improvement in that sense.

RNNs

I read “The Unreasonable Effectiveness of Recurrent Neural Networks” by Andrej Karpathy this weekend. I saw it recommended/linked several times so I added it to my ML to-do/to-read list, and I’ve just gotten to it. Some highlights I copy/pasted into Keep follow:

As you might expect, the sequence regime of operation is much more powerful compared to fixed networks that are doomed from the get-go by a fixed number of computational steps, and hence also much more appealing for those of us who aspire to build more intelligent systems.

Then:

Moreover, as we’ll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that RNNs are Turing-Complete in the sense that they can to simulate arbitrary programs (with proper weights). If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.

Interesting. Karpathy does add a caveat about not reading too much into this, and I can see how this “universal program approximation” thing of RNNs has also other more indirect ties to “Turing completeness”, in the sense that people sometimes get hung up on Turing completeness when in many cases it just isn’t very relevant — as in, it’s a pretty low bar for a programming language or platform in the day-to-day and it doesn’t mean much in practice. Still, the fact that RNNs trained character-by-character are able to pick up greater and greater levels of structure seems very promising. I found the visualizations of per-neuron activity very illuminating: Karpathy finds a neuron that “learns” to be “on” when inside a quotation, and another that gets activated as the text gets closer to where a newline would usually appear. This is all structure that a programmer would likely think about and code by hand they had to hand-code a text generator, and the network is just learning it independently from data.

The article is from 2015, but some people seem to think it’s a bit dated by now — not in its basic approach necessarily, but rather because convolutions have taken over from RNNs/LSTM in many domains. gwern left this comment in Hacker News (I swear I’m not stalking him, he just keeps popping up in the stuff I read):

If this were written today, Karpathy would have to call it “The Unreasonable Effectiveness of Convolutions”. Since 2015, convolutions, causal or dilated convolutions, and especially convolutions with attention like the Transformer, have made remarkable inroads onto RNN territory and are now SOTA for most (all?) sequence-related tasks. Apparently RNNs just don’t make very good use of that recurrency & hidden memory, and the non-locality & easy optimization of convolutions allow for much better performance through faster training & bigger models. Who knew?”

My current plan is to experiment a bit with RNNs/LSTM and then move on to convolutions.