Lunduke
News • Science & Tech
The Unlikely Story of UTF-8: The Text Encoding of the Web
Plan 9, Placemats, New Jersey Diners, and last minute ideas
June 22, 2023
post photo preview

If you are reading this on a computer -- of any kind -- odds are good that the words on the screen are all encoded using something called "UTF-8".

UTF-8 (or "Unicode Transformation Format - 8 bit") is, put simply, a format for encoding and storing text -- one which allows for far more text characters than the older "ASCII" encoding (which could only show a total of 95 printable characters).

And UTF-8 is, quite simply, everywhere.

Nearly every major computer operating system heavily uses UTF-8 for handling text... likewise it is the standard for websites, with close to 100% of all webpages explicitly using UTF-8 for the text on the page.

test
The source for Wikipedia.  Like most of the web, using UTF-8.

An argument could be made that UTF-8 is one of the most successful and widely adopted standards in all of computer history.

But this almost wasn't the case.

In fact, UTF-8 was created -- at the very last possible moment -- and it was first implemented in a computer system that most people don't even know existed.

X/Open's search for better text encoding

In the early 1990s, text encoding was... an issue.

While solutions for extended character sets (beyond simple ASCII characters) existed, they were less than ideal.  To put it mildly.  The most popular solution, known as UTF-1 (aka "ISO 10646"), suffered from serious performance issues... and often caused significant problems with software which used plain "ASCII" text (including UNIX file system paths).

Having a character encoding on UNIX systems that could cause problems with UNIX file systems?  Not good.

Obviously a new type of text encoding was needed.

So, in 1992, X/Open (originally known as the "Open Group for UNIX Systems", a consortium of UNIX vendors, including: Sun, HP, AT&T, IBM, and several others) set about the task of selecting a proper text encoding standard to be used across all of the UNIX world.

The proposal that gained the most traction was known as FSS/UTF (aka "File System Safe Universal Character Set Transformation Format").  Roills off the toungue, right?

This proposal was both faster than the old text encoding standard... and, as the name suggests, it was "File System Safe".  Which was a big win.

Enter: The Plan 9 Nerds

Which brings us to September 2nd, 1992.  Sometime in the early evening.

The X/Open group was meeting, in Austin, Texas, to formally decide on the file encoding standard.

Looking to get some feedback on the proposal, some members of X/Open made a call to two legendary programmers -- Ken Thompson and Rob Pike -- who were working on the Plan 9 Operating System project at Bell Labs in New Jersey.


A little background...

Ken Thompson was one of the creators of MULTICS, UNIX, the B programming language (the predecessor to C), among many other accomplishments.

Rob Pike, also a UNIX programmer, was the co-creator of Blit, writer of multiple UNIX and programming books, and the creator of the first UNIX windowing system.

To call these two "absolute legends" in the world of computing would be, perhaps, a bit of an understatement.  The two were currently working together on a research operating system, at Bell Labs, called Plan 9.  An attempt to fix some of the perceived shortcomings of UNIX... by the creators of UNIX, itself.


What happened next... after Ken Thompson and Rob Pike received that phone call?  Luckily, we have a detailed accounting... written by Rob Pike, himself.

"We had used the original UTF from ISO 10646 to make Plan 9 support 16-bit characters, but we hated it.  We were close to shipping the system when, late one afternoon, I received a call from some folks, I think at IBM - I remember them being in Austin - who were in an X/Open committee meeting.  They wanted Ken and me to vet their FSS/UTF design."

Asking two legendary engineers for their input?  You can probably guess what happened next...

"Ken and I suddenly realized there was an opportunity to use our experience to design a really good standard and get the X/Open guys to push it out.  We suggested this and the deal was, if we could do it fast, OK."

That's right.  Ken and Rob had some ideas.  And the X/Open folks agreedd to listen to those ideas... if they could get them something fast.

And, by fast, they really meant "immediately... like... right now."  Because the X/Open team were, quite literally, all gathered in Austin to decide on this... right then.

"Yeah.  I could eat."

Ken and Rob did what any good programmers would do when placed on an almost impossibly tight deadline -- and needed to come up with an amazing idea that could change the course of computing for decades to come... they went out to grab some grub.

"So we went to dinner, Ken figured out the bit-packing, and when we came back to the lab after dinner we called the X/Open guys and explained our scheme.  We mailed them an outline of our spec, and they replied saying that it was better than theirs (I don't believe I ever actually saw their proposal; I know I don't remember it) and how fast could we implement it?  I think this was a Wednesday night and we promised a complete running system by Monday, which I think was when their big vote was."

Remember.  This was 1992.

Which means, while laptops and such certainly existed, most people (even legendary programmers) did not have any sort of mobile, portable computers.  Certainly not the kind you could take out to a restaurant.

So what, pray tell, did they write their new text encoding design on?

A placemat from a New Jersey diner.

This is not the placemat that UTF-8 was designed on.

Seriously.

"UTF-8 was designed, in front of my eyes, on a placemat in a New Jersey diner."

The boys, Ken and Rob, now had just a few days to get all of this done -- before the big vote on the new text encoding standard.  And they sure as heck didn't waste any time.

They got back from dinner, placemat in hand, and got to work.

"So that night Ken wrote packing and unpacking code and I started tearing into the C and graphics libraries.  The next day all the code was done and we started converting the text files on the system itself.  By Friday some time Plan 9 was running, and only running, what would be called UTF-8.  We called X/Open and the rest, as they say, is slightly rewritten history."

They converted an entire operating system over to a brand new -- just designed on a placemat -- text encoding format... in less than two days.

Here's the rough time-line:

  • Wednesday (Sep 2) evening: Dinner at a New Jersey Diner.  Ken Sketches out the idea on a placemat.
  • Wednesday night: Coding begins.
  • Thursday: Coding complete.
  • Friday: Entire Plan 9 operating system is now using "UTF-8".
  • Monday (Sep 7): X/Open group votes to use the Ken/Rob encoding design.

On Tuesday, September 8th, 1992 (at 3:22am), mere hours after the official vote to accept their text encoding design, Ken Thompson sends out the following email regarding Plan 9 now using UTF-8:

"The code has been tested to some degree and should be pretty good shape.  We have converted Plan 9 to use this encoding and are about to issue a distribution to an initial set of university users."

That's right.

Ken and Rob got a call asking for feeback on a Wednesday.  By the next Tuesday (at 3am) they were ready to ship a version of their Plan 9 OS with all the changes, and their designs had been voted on by the largest UNIX companies in the world.

Like I said.

A recent picture of the two legends, themselves.

These guys are legends.

What about that placemat?

Considering the vast impact of UTF-8 on the world of computing... whatever happened to that original "design document" (aka "the placemat")?  It would certainly be of historic significance.

"I very clearly remember Ken writing on the placemat and wished we had kept it!"

Let this be a lesson to all of the programmers out there:

Keep all of you doodles, notes, and sketches you make for your projects... you never know when one of those projects will become critical to the entire world... making your quick sketch worthy of being in a museum.

Especially if it's on a placemat.  From a diner.  In New Jersey.


Copyright © 2023 by Bryan Lunduke.  All rights reserved.  The contents of this article are licensed under the terms of The Lunduke Content Usage License.

community logo
Join the Lunduke Community
To read more articles like this, sign up and join my community today
11
What else you may like…
Videos
Podcasts
Posts
Articles
September 07, 2024
Internet Archive Takes Another Step Towards Death

Archive.org loses appeal in book copyright case with the Sony / Universal Music lawsuit still looming on the horizon.

The Internet Archive Loses Appeal. As Expected.
https://lunduke.locals.com/post/6079435/the-internet-archive-loses-appeal-as-expected

More from The Lunduke Journal:
https://lunduke.com/

00:30:58
Pop!_OS Lead: Linux Developers are “Patronizing Pedantic Megalomaniacs”

System76’s Principal Engineer doesn’t “even try to contribute to the Linux kernel anymore.”

The article:
https://lunduke.locals.com/post/6052448/pop-os-lead-linux-developers-are-patronizing-pedantic-megalomaniacs

More from The Lunduke Journal:
https://lunduke.com/

00:22:45
Zuck Regrets Censoring Facebook at Request of Democrats

"The White House, repeatedly pressured our teams for months to censor certain COVID-19 content, including humor and satire."

Warning: This show is extremely political. It has to be. There simply is no way to discuss the topic without being political. Just the same, the core of the topic is regarding the usability of digital, online publishing and messaging platforms -- a topic near and dear to the heart of those of us who have lived through the ages of the BBS, Usenet, Geocities, and the like.

More from The Lunduke Journal:
https://lunduke.com/

00:40:29
November 22, 2023
The futility of Ad-Blockers

Ads are filling the entirety of the Web -- websites, podcasts, YouTube videos, etc. -- at an increasing rate. Prices for those ad placements are plummeting. Consumers are desperate to use ad-blockers to make the web palatable. Google (and others) are desperate to break and block ad-blockers. All of which results in... more ads and lower pay for creators.

It's a fascinatingly annoying cycle. And there's only one viable way out of it.

Looking for the Podcast RSS feed or other links? Check here:
https://lunduke.locals.com/post/4619051/lunduke-journal-link-central-tm

Give the gift of The Lunduke Journal:
https://lunduke.locals.com/post/4898317/give-the-gift-of-the-lunduke-journal

The futility of Ad-Blockers
November 21, 2023
openSUSE says "No Lunduke allowed!"

Those in power with openSUSE make it clear they will not allow me anywhere near anything related to the openSUSE project. Ever. For any reason.

Well, that settles that, then! Guess I won't be contributing to openSUSE! 🤣

Looking for the Podcast RSS feed or other links?
https://lunduke.locals.com/post/4619051/lunduke-journal-link-central-tm

Give the gift of The Lunduke Journal:
https://lunduke.locals.com/post/4898317/give-the-gift-of-the-lunduke-journal

openSUSE says "No Lunduke allowed!"
September 13, 2023
"Andreas Kling creator of Serenity OS & Ladybird Web Browser" - Lunduke’s Big Tech Show - September 13th, 2023 - Ep 044

This episode is free for all to enjoy and share.

Be sure to subscribe here at Lunduke.Locals.com to get all shows & articles (including interviews with other amazing nerds).

"Andreas Kling creator of Serenity OS & Ladybird Web Browser" - Lunduke’s Big Tech Show - September 13th, 2023 - Ep 044

Sometimes, you never know what you're going to find, when you go outside. I went for a walk today, and there was a sidewalk sale outside one of the homes near where I live. One of the items for sale: A Dell Inspiron 3000 (Pentium MMX 200mhz) with 80MB ram, and a 2GB hard drive for £75. I figured "ah, what the heck, why not?"

Huzzah! It's in PERFECT WORKING ORDER! The only thing wrong with it, is the same thing that happened to all Dell batteries back then: it's drains normally until about 30%, and then just unceremoniously SHUTS DOWN. So, you kind of have to use the power supply to be safe.

The dilemma I face: Do I leave Windows ME on this thing? Or, do I replace it, with maybe FreeDOS or a copy of Windows 98 or something else?

It came with Office 97, which has the original CLIPPY! Would be a shame to blow that away. Also, it has a working copy of Cakewalk! I haven't seen Cakewalk in decades. I'd lose that too.

Still, It would be fun to see if I could get the BT Wifi PCMCIA card...

Well, it's not like we didn't see THIS coming from a mile away:

https://www.boredpanda.com/family-poisoned-ai-generated-mushroom-identification-book/

What's especially fascinating, is that this article itself sounds like it was partially written by cobbling together the responses to several AI prompts:

post photo preview
2 hours ago

Hyprland 0.43.0 is out.

Looks like there are a lot of developers working on dozens of bugs and features. It is impressive the team of ordinary developers this project has attracted. That said, it is impossible to summarize such a large list of updates, fixes, and new features. So, here are a couple categories of updates and fixes:

  • Keyboard control,
  • command line apps,
  • tiling windows in a multi screen and resolution environment, and
  • Wayland / X compatibility.

This list is extensive, and I can't wait to test it out. If you're a fan or just curious, it might be time to get the Hyprland ecosystem a spin.

https://github.com/hyprwm/Hyprland/releases/tag/v0.43.0

September 07, 2024
post photo preview
Funny Programming Pictures Part LIV
The Roman Numerals makes ‘em fancy.

Fun fact: I hit CTRL-C at least 7 times when copying each of these pictures.

You know.  Just to be sure.

You're welcome.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Read full Article
September 04, 2024
post photo preview
Mount a drive image from a remote server... on DOS.
Want your D:\ to point to an image running on a Linux box across the world? Yeah, you do.

I'm a sucker for software which makes aging operating systems more useful.  As such, I am absolutely enamored with a new DOS utiltity called "mTCP NetDrive".

What is NetDrive?

"NetDrive is a DOS device driver that allows you to access a remote disk image hosted by another machine as though it was a local device with an assigned drive letter. The remote disk image can be a floppy disk image or a hard drive image."

Yeah.  Mount -- read and write -- a drive image remotely (anywhere in the world).  From DOS.

 

 

mTCP -- a suite of networking tools for DOS (like Ping, a DHCP client, an FTP client, etc.) -- has been around for a long time.  The developer, Michael Brutman, has truly done a phenomenal job building and mainting all of those tools.

But NetDrive really turns things up to 11.

  • You can place disk image on a server (remote or local) and mount it from any DOS machine.  The whole thing uses UDP.
  • The DOS driver uses less than 6 KB of RAM.  Keeping driver overhead low on DOS machines is important.
  • The drive images are simple raw disk images -- which means we can mount and manipulate them easily.
  • You can even mount multiple images at once -- from multiple different servers.

Oh!  The server is a lightweight application that runs (with no need for root access) on Linux or Windows.  Want to host your DOS images on a Raspberry Pi?  Yes.  You do.

 

 

What's more, the local DOS system simply recognizes the mounted drive as a standard hard drive (mounted as a configurable drive letter).  Which means that just about any software should work on it without difficulty.

Even disk management and optimization tools, like Norton Utilities, work fantastically.

 

 

As you can imagine, using NetDrive over the Internet can get a bit pokey.  Especially on a less-than-speedy connection.  But over a local network?  The darn thing runs at a very usable speed.

And -- even with potential speed issues when running on a remote server -- I absolutely love the idea of having a set of DOS drive images which I can mount from anywhere.  Heck.  I could even share some of those images with friends -- to use as a sort of DOS repository.

The developer has even added features like "undo" and "checkpoints" to make it easy to roll back "woopsies".  On a DOS drive image.  Mounted on a remote server.

Come on.

That's just nifty.

Read full Article
September 05, 2024
post photo preview
The Internet Archive Loses Appeal. As Expected.
With more legal action on the horizon, how long before Archive.org closes?

The United States Court of Appeals (Second Circuit) just issued a ruling against the Internet Archive (Archive.org) -- rejecting their appeal, and upholding a previous ruling against them in the Hachette vs Internet Archive legal battle.

Make no mistake: This is very bad news for both the Internet Archive, Archive.org users, as well as other archival projects.

 

 

 

Hachette v. Internet Archive: The Short, Short Version

 

To make sure everyone is up to speed, here is the short, short version of this legal battle.

For many years, the Internet Archive has been creating digital copies of physical books (by scanning them) -- then allowing people to "borrow" those digital versions from Archive.org (in theory limiting the total digital books being "lent out" to the count of the physical books in the Archive's possession).

They never obtained permissions from the authors or publishers to do any of this.

In 2020, during the Covid lockdowns, the Internet Archive launched the "National Emergency Library" -- where they removed that "1 physical book : 1 digital book lent out" restriction.  Meaning anybody on the Internet could obtain digital scans of physical books... and the Archive could "Lend Out" an unlimited number of digital copies based on a single physical copy.

Again.  No permission was obtained from the writers or publishers.

Thus -- to the surprise of absolutely nobody -- the "Hachette v. Internet Archive" legal battle began.

And... The Internet Archive lost.  The judge ruled in favor of the publishers (including Hachette, Wiley, Penguin Random House, & HarperCollins).

Naturally, Internet Archive appealed that ruling.  But, boy-howdy, was their appeal a strange one which was destined to fail.

 

The Strange Appeal of The Internet Archive

 

On April 19th of 2024, the Internet Archive filed their final brief in their attempt to appeal this ruling against them.

In that ruling, one of the Internet Archive's core arguments was that it cost the Internet Archive a lot of money to make so many digital copies of books without permission... so, therefore, the Internet Archive should be allowed to do it.

That is neither a joke nor an exaggeration.  It sounds weird, because it is weird.

The Internet Archive truly attempted to make the case that spending a lot of money committing a crime... should make that crime legal.  (Could you imagine the mafia making that case?  Wild.)

You can read the full analysis, by The Lunduke Journal, of the appeal (including the appeal itself) for yourself for more details.

The reality is... there was never any chance that the Internet Archive's attempted appeal was going to be successful.  Their defensive arguments were highly illogical (bordering on flights of fancy), and brought nothing new or noteworthy to the case.  This was all painfully obvious.

 

The Lost Appeal

 

On Wednesday, September 4th, 2024, the opinion was handed down from the United States Court of Appeals.

While the full ruling is roughly 64 pages long, this single paragraph -- from the second page -- summarizes things quite well:

 

"This appeal presents the following question: Is it “fair use” for a nonprofit organization to scan copyright-protected print books in their entirety, and distribute those digital copies online, in full, for free, subject to a one-to-one owned-to-loaned ratio between its print copies and the digital copies it makes available at any given time, all without authorization from the copyright-holding publishers or authors? Applying the relevant provisions of the Copyright Act as well as binding Supreme Court and Second Circuit precedent, we conclude the answer is no. We therefore AFFIRM."

 

To call out the truly important parts:

"Question: Is it 'fair use' ... to scan copyright-protected print books in their entirety, and distribute those digital copies online, in full, for free ... all without authorization from the copyright-holding publishers or authors? ... we conclude the answer is no."

You can read the entire 64 page ruling for yourself.  Heck.  You can even read it on Archive.org.  But that line, right there, sums it all up.

Naturally, the Internet Archive has issued a statement.  Albeit... a short one.

 

"We are disappointed in today’s opinion about the Internet Archive’s digital lending of books that are available electronically elsewhere. We are reviewing the court’s opinion and will continue to defend the rights of libraries to own, lend, and preserve books."

 

What Happens Now?

 

The Internet Archive gets sued by some of the biggest book publishers... and loses.

The Internet Archive appeals... and loses.

What happens next?  Well.  Unfortunately -- for both the Internet Archive, and its users -- the future looks rather bleak.

First and foremost: Has the Internet Archive made, and distributed, digital copies of work you own?  This ruling will certainly not hurt your case should you decide to take legal action against Archive.org.

And -- holy smokes -- the amount of copyrighted material on Archive.org is absolutely massive.

The Archive.org software repository alone contains millions of items.  With a very large number of them being copyrighted material, posted there without permission of the copyright owner.

Simply going by the numbers, here's how much material is available on Archive.org (roughly):

  • 832 Billion archived webpages.
  • 38 Million printed materials (magazines, books, etc.).
  • 2.6 Million pieces of software
  • 11.6 Million videos files.
  • 15 Million audio files.
  • 4.7 Million images.

How many of those items do you think are there without permission (or possibly even knowledge) of the owners or creators?

Every single one now has an increasingly strong case when looking at potential legal action.

And it's about to get even worse for the Internet Archive.

 

UMG Recordings v. Internet Archive

 

That's right, the book publishers weren't the only ones taking legal action against Archive.org. 

Universal Music Group and Sony have an ongoing lawsuit against the Internet Archive -- regarding the distribution of 2,749 audio recordings (with potential damages upwards of $412 Million USD).

Seriously.

 

"Plaintiffs bring this suit to address Defendants’ massive ongoing violation of Plaintiffs’ rights in protected pre-1972 sound recordings. As part of what Defendants have dubbed the “Great 78 Project,” Internet Archive, Blood, and GBLP have willfully reproduced thousands of Plaintiffs’ protected sound recordings without authorization by copying physical records into digital files. Internet Archive then willfully uploaded, distributed, and digitally transmitted those illegally copied sound recordings millions of times from Internet Archive’s website."

 

Sound familiar?  Digital copies.  No permission from the artists or publishers.  Free downloads for everyone.

Naturally, the Internet Archive attempted to have this suit dismissed... but their attempt was denied in May of 2024.  (Because if there's one constant in life... it's that the Internet Archive always loses in court.)  That case is going forward.

 

 

What happens if the Internet Archive loses this UMG / Sony case?  What happens if they are ordered to pay $412 Million in damages?

To put it simply: Archive.org doesn't have that kind of money.  They bring in roughly $20 Million (give or take) per year.  That type of legal liability would absolutely destroy the Internet Archive.

 

 

And, here's the thing, the Internet Archive is almost assuredly going to lose that lawsuit as well.

Regardless of what you, I, or anyone else thinks of the Internet Archive -- and, make no mistake, I use that service several times a week (and love it) -- the law here is incredibly clear and well tested.

The Internet Archive runs one of the largest (if not the largest) website of pirated and stolen digital material on the planet.  Sure, it may also provide extremely valuable (and often, very legal) services as well.. but that doesn't make those crimes go away.

With each legal defeat, the Internet Archive grows increasingly vulnerable to additional attacks.

Simply being logical about it... it seems highly likely that we'll see additional suits brought against the Internet Archive in the months ahead.  Books, music, TV shows, software... Archive.org contains a massive mountain of copyrighted material in all areas.  These are suits which the Internet Archive would be almost certain to lose.

With this reality looming, how long until Archive.org will be forced to shut down entirely?  That day is likely not far off... and a sad day it will be.

 

The Archive Had to Know This Was Coming

 

The truly sad part?  The leadership of the Internet Archive had to know exactly what they were doing.

Every step of the way, it was obvious that they were going to lock horns with publishers (and lose).

Heck, I told them.  Repeatedly.

But, even if The Lunduke Journal hadn't pointed this out... it was a brutally obvious certainty to anyone even mildly familiar with copyright law and the workings of Archive.org.

Which means: The Internet Archive knowingly put their entire service at risk (including the Wayback Machine, the massive archive or pre-copyright audio recordings, etc.) because they wanted to publish copyrighted material against the wishes of the authors or publishers.

Despite this, they continue to push a public perception campaign where they pretend that publishers and authors are burning their own books.  When the reality is... the books are still available a wide variety of ways.  Archive.org simply got in trouble for copying and distributing them without permission.

 

 

Something I find truly fascinating about all of this, is that The Lunduke Journal will -- as usual -- get yelled at (rather extensively) for this article.  For simply pointing out the current reality of copyright law and how the Internet Archive has, knowingly, violated it.

People love Archive.org.  Heck, I love Archive.org.

And people are allowing their love for that website to convince them that anyone being critical of it... must, necessarily, be bad and evil.  An enemy.

But it is not The Lunduke Journal who is putting The Internet Archive in danger of being shut down.

Neither is it Sony, Hachette, Random House, or HarperCollins who are putting The Internet Archive in danger.

No, sir.

The only one putting The Internet Archive in danger... is The Internet Archive.

Read full Article
See More
Available on mobile and TV devices
google store google store app store app store
google store google store app tv store app tv store amazon store amazon store roku store roku store
Powered by Locals