Research on SQLite UUIDs

SQLite UUIDs

** UUIDs **

http://sqlite.1065341.n5.nabble.com/How-to-store-128-bit-values-td25224.html

In reply to this post by Steve Krulewitz On 7/11/07, Steve Krulewitz <[hidden email]> wrote:

In the application I am working on (Songbird), we have a simple two table schema representing the tracks in your music collection and the properties on those tracks. The keys used are all UUIDs (128 bit number) which we are currently storing in hex string form in a text column, so they literally appear in the database as “ee89b823-e709-40c5-bed7-dcb0b2b791a8”. We do lots of lookups and joins on these keys.

I was wondering if there is much to be gained by storing these 128 bit values in binary rather than as strings.

Probably, because in a database space is the same as speed in many cases. Smaller databases will generally be faster because you can cover more items in the same number of disk reads.

  • Use a pair of integer columns

Probably a bad idea. SQLite stores integers using a variable encoding with 7 bits of data stored per byte. That works well for data where the high-order bits are mostly 0 (lots of small positive integers). This data would be expected to have uniform distribution, so you’ll probably lose on both encoding/decoding overhead and space.

  • Continue to use a text column and shove binary data into it and hope for the best
  • Use a blob column, but I wonder how efficient indexes and joins are on this column type?

The blob column should be strictly more performant than the same-size text column, because SQLite treats blob data as literal bytes, while the text column may involve other interesting operations (though generally doesn’t). Since in this case the blob column would be able to store data more compactly, it is likely to win across the board.

  • Create another table in the schema that maps integers to UUIDs, and use these integers in the rest of the schema rather than the UUIDs. This would add an additional join every time we want to do a query based on UUID.

It really depends on how often this is happening. SQLite tables are organized by rowid. An index on a table maps the indexed column to rowid, so querying through an index involves traversing two (generally) uncorrelated btrees. If you have many tables with indices on these UUID values, it might make sense to consolidate into a single index and use rowids everywhere else.

You might even be able to get away without that index. If you hash the UUID and take the low-order 64 bits of the result, you could use that as the rowid directly. Even the low-order 32 bits might suffice. [Don’t just take the low-order bits of the UUID directly, since you have no idea if they were generated appropriately, you might run into an unacceptable level of collisions.]

-scott


In reply to this post by Steve Krulewitz “Steve Krulewitz” <[hidden email]> wrote:

Hey all –

In the application I am working on (Songbird), we have a simple two table schema representing the tracks in your music collection and the properties on those tracks. The keys used are all UUIDs (128 bit number) which we are currently storing in hex string form in a text column, so they literally appear in the database as “ee89b823-e709-40c5-bed7-dcb0b2b791a8”. We do lots of lookups and joins on these keys.

I was wondering if there is much to be gained by storing these 128 bit values in binary rather than as strings.

Storing the UUIDs as BLOBs rather than as hex-encoded strings will requires 21 bytes per key instead of 43. So your indices will have about twice their current fanout. That means that you can expect to roughly double your lookup performance on a cold cache. (If the database, or large parts of it, is already in your OS cache, the difference will be much less and will likely not be noticable.)

The downside of using BLOBs is that you cannot see them easily during debugging when you do a “SELECT * FROM…“. You have to use constructs like, “SELECT hex(uuid), … FROM” which is more typing. You can fix that with a VIEW, I suppose.

The thing to consider is why you are using 128-bit UUIDs in the first place? Presumably you are doing this so that you can sync and/or merge independently created databases without having to worry about key collisions. If you are not doing syncs or merges or independently generated keys, then I can’t think of a good reason to use 128-bit UUIDs in the first place. Just use small integer autogenerated keys from SQLite.

Assuming you are doing syncing and merging, what I tend to do in these kinds of situations is to create a mapping between the universally unique ID (UUID) and a small integer ID that is unique in the local database - call the latter the RID. The RID is just an integer and can be your INTEGER PRIMARY KEY for very fast lookups. Looking up a record by INTEGER PRIMARY KEY is always twice as fast as looking up the same record by any other key, and because of high fanout can potentially be much faster if you have a cold cache. So I use the RID for joins or other internal operations and only use the UUID for lookups from external data.

For example, your schema might look something like this:

CREATE TABLE album( aid INTEGER PRIMARY KEY, – Internally unique album ID uuid TEXT UNIQUE, – Universally unique album ID … ); CREATE TABLE track( tid INTEGER PRIMARY KEY, – Internally unique track ID uuid TEXT UNIQUE, – Universally unique track ID aid INTEGER REFERENCES album, – Album containing this track … ); CREATE INDEX track_album ON track(aid);

A typical query might be to find all tracks of an album:

SELECT * FROM track WHERE aid=(SELECT aid FROM album WHERE title=?)

And queries like this will run much faster using INTEGER PRIMARY KEYS rather than UUIDs.

All that said, for even the largest music collection, your database is unlikely to have more than a few thousand albums and a few tens of thousands of tracks, and with such a small database and running on a modern workstations, probably just about anything you do is going to be fast enough. The optimizations described above are useful if you have tables with millions of entries or if you are doing thousands of queries per second or if you are running on some woefully underpowered portable music player. But for a personal music library running on a workstation at human-interaction speed, use whatever schema makes it easiest for you to type in correct and working code.

– D. Richard Hipp <[hidden email]>


Looking up a record by INTEGER PRIMARY KEY is always twice as fast as looking up the same record by any other key


You always have an INTEGER PRIMARY KEY on every table. It’s part of SQLite storage mechanism.

If you don’t explicitly declare one, it’s still there under the name ROWID or OID (and a few other synonyms). By explicitly declaring it, you simply give the same column yet another name.

So no, you won’t gain anything by trying to avoid this column - it is always there whether you declare it or not.


source: http://wtanaka.com/node/8106

Most compact way to store UUIDs in sqlite

Submitted by wtanaka on Mon, 2014-04-28 01:22. (Android Tech ) If you want to store a UUID in SQLite and use the minimal possible amount of disk space, it seems like the best solution may be to represent each UUID as a 16 byte binary string and store it in a BLOB column. I tested inserting 100,000 UUIDs into a table using three strategies with SQLite 3.7.9, and the resulting file sizes were:

2447360 bytes (46% smaller than canonical form): One BLOB column to store the 16-byte binary string 2568192 bytes (43% smaller than canonical form): Two INTEGER columns to store the most significant and least significant 64 bits in the UUID, respectively 4112384 bytes (9% smaller than canonical form): One TEXT or BLOB column storing the minimal hex string (e.g. ca7b9ad52afa4c78bad98d0d751fee33) 4498432 bytes: One TEXT column storing the canonical form (e.g. ca7b9ad5-2afa-4c78-bad9-8d0d751fee33)