Shit, blew through a milestone

You know you are an old time computer geek when you notice special numbers.

2^8 is something we geeks all know, it is the number of different bytes there are. 2^8 is 256, and the maximum value of a single byte of computer memory is 255.

2^10 is 1024. I am guessing that most kids today see that number and scratch their head wondering why the old fogeys didn’t just call 1000 bytes a kilobyte.  But us true geeks embrace these.

The next number is 2^16 which is 65,536, and is the maximum number representable as two bytes (256 * 256 for those counting). I am amazed at how often I stumble across this number int he real world.

Then there is the first CPU I learned to program, the Mostek 6502. Designed to be a cheaper version of the motorola 8800, it really drove the home computer revolution in the late 70’s and early 80’s.

Two posts ago, I hit post 256. Makes me smile to mark off these milestones. I suspect I will make it to 1024, but probably not 65,536

(For the curious as to why 8 bits to a byte, and 2 bytes to a word, I encourage a google search into early computer architectures, some details on the early systems, and then try to relate it to your current laptop or tablet, and your eyes will be well and truly opened. Also read “Turing’s Cathedral” for a good narrative on the first electronic digital computer).

Fixing ebooks with errors – A personal challenge

I have wholly embraced the eBook revolution. As a long time traveler, and SciFi aficionado, I have assembled a large collection of books that I continue to read to mark them off my to do list.

Being a fan of science fiction, I have been forced to acquire some of my books by extra-legal means. Since many of the classic tomes of the golden era of SciFi are out of print, and have no official ebook release to buy, I turn to the internet.

With few exceptions, these books are scanned and OCR’d from print, and then stuffed into a file to read. Lots of early Heinlien, and obscure authors exist only this way.

The problem, OCR still sucks.  Even the best algorithms barf a lot on text and thus there are spots of garbage in many of these books.

I sometimes make it a personal mitzvah to clean up a book.

Classic example was the “To the Stars” trilogy, by Harry Harrison (his real name, not a nom de plume). It was a rather poor scan and conversion to an RTF file. It was a painful process to fix, but totally worth it, because it made the book completely readable.

However, if your book is in ePub of PDF format, you have fewer options.

Sigil, a pretty awesome open source ePub editor
Sigil, a pretty awesome open source ePub editor

The program I go to is Sigil. Provided there is no DRM, you can open and inspect the book, and fix small things. If you are savvy, you can also dive into the CSS stylesheet and alter fonts, indents, and other text properties (but be warned, some readers ignore much of the CSS codes and classes – I’m looking at you Sony Reader).

Sigil allows you to look at the text as it renders, at a split screen with the code below the rendered text, or just pure code. You can fix a lot of errors and glitches with the search and edit the code, saving back to the original file.

A future series of posts will go into depth on how to better structure the ebook.

Another good program, and one that is widely used Calibre. A library, and file manipulation program, it is open source and extensible. It makes it easy to convert from one format to another (Kindle to ePub, or LRM to ePub, and many other options.)

A nice touch is that in Calibre you can better setup the ISBN, the cover images, and get data on the book from public databases. I used Calibre to convert a collection of Doc Savage stories from the lrm format (the original Sony Reader format) to ePub, and to add good cover pictures.

In fact, most of the ebook files I look at in Sigil have signs of being converted/cleaned by Calibre, even some commercial books.

Doing this work, you find some things like:

  • Files which came from Microsoft Word – littered with the “class=msonormal” tag. Ugh. I don’t usually curse too much about microsoft office, but what it outputs for HTML that is converted into an ebook is a crime against humanity.
  • Most ebooks, even commercial, professionally edited and assembled ones, have horrible structure. Not proper links to the chapters, nor proper tables of contents. Commercial books are much more likely to get this right, but it is a disaster on the community sourced works. I am working up a process to fix that.
  • There are some truly shitty OCR engines out there. Even high priced, high performance engines have trouble, the second tier is atrocious. Someone once grumbled on Slashdot why there weren’t any good (free) open source OCR engines, and the answer is that because it is friggin hard, and it often becomes a lifetime’s work to tune and improve the algorithm, so the good ones are not in a hurry to be given away.

I rarely make a mission to fix an ebook, but when I do, I want to leave something that is a better experience to read.

(For the record, if there is a place to buy a book, I will always buy it, but much of what I read is esoteric, or out of print, so I am forced into alternatives. )

A kudos to Courtyard

As a road warrior, I spend a lot of time on the road, sleeping in hotel rooms. I have learned to deal with loud ice machines, obnoxious families with kids tearing up and down the hallways at all hours, lumpy beds, and loud air conditioners. Doesn’t matter if it is a $90 La Quinta, or a $300 Hilton room, they all have warts.

Courtyard by MarriottUsually the one thing that grinds my gears is the  bed in the room. Usually it is either stiff as a board, or completely worn out. Regardless that leads to a poor night of sleep, and a lot of discomfort (that getting older makes much worse) due to bad posture in bed.

However, I have to congratulate the Marriott Courtyard in Campbell, CA. The bed is perfect in my room. Supportive, comfortable, and I have had two good nights’ sleep in a row, a real rarity!

I will stop complaining about the slow as molasses in January elevators in Courtyards if you can make sure that all the beds are this perfect.