Primary Key/Clustered Index

Mike_R

Okay,

I generally don't post much worthwhile stuff, but here's one. Something
I find a bit irksome about SQL Server, In the table designer (yes, I
use it, so what!) if you select a field to be your primary key, it
creates a clustered index. My understanding of clustered indexes was
that they should be used to reduce the number of records that need to
be scanned, for example (to make it relavent to message boards), you
have a table of posts. You might have a clustered index on the posts
table using TopicID as its key, rather than the default PostID. What
this means is that the rows will no longer be physically arranged by
their primary key, but rather by their topic (i.e. when you list the
posts in a topic, you narrow the number of rows you have to look at,
using another index on a column like PostTime to sort the (much
shorter) list)

My question is: What is the reason for SQL Server using a clustered
index on a primary key, and are there situations where this is a strong
advantage? (Yes, it should be indexed, thats not in question...)

apapadimoulis

First (and you seemed to expect this), you should get in the habbit of writing DDL instead of using Enterprise Manager. As the system grows, it becomes migration hell to move between environments (Dev, Test, UAT, Prod, etc) without DDL scripts and change management. Another advantage is that you have control and can explicitly see everything done.

Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.

That said, a clustered index will physically order the rows as they are inserted/updated based on the column(s) indexed. This also happens to be why only one clustered index is permitted per table. SQL Server defaults to a clustered index on the primary key because PK's are unique (therefore ensuring an even spread of rows across pages with no "hotspots") and because PK's don't change very often (or shouldn't, at least).

As you indicated, a TopicID is a better candidate for your clustered index because you'll often be pulling an entire Topic worth of posts. Since they're close together physically, it saves a lot of IO. But if you don't have such a requirement, you're usually best sticking your clustered index on the PK so you've got a good random distribution across pages.

Mike_R

@Alex Papadimoulis said:

First (and you seemed to expect
this), you should get in the habbit of writing DDL instead of using
Enterprise Manager. As the system grows, it becomes migration hell to
move between environments (Dev, Test, UAT, Prod, etc) without DDL
scripts and change management. Another advantage is that you have
control and can explicitly see everything done.

It's a hard habit to break, when everyone around you is in the same habit, but it tends to really come back to haunt you when you need to build scripts for changes to tables... At prior employers they had a rule against using anything but DDL scripts, becuase the script that you used to change the schema had to be added to source control, and also needed to be combined later with other scripts when an update was sent to the data center.

Second, fields do not exist in a relational database. Columns do. There is a fundamental difference between the two terms. Ditto for rows versus records. Usage of the proper ensures a strong mental model of the domain; think mixing up your derivatives and integrals in calculus.

I don't know why, but I've always had difficulty with this terminology (I tend to get it mixed up frequently, More along the lines of "I know what I mean", hence the inconsistent use in my own post.)

That said, a clustered index will physically order the rows as they are inserted/updated based on the column(s) indexed. This also happens to be why only one clustered index is permitted per table. SQL Server defaults to a clustered index on the primary key because PK's are unique (therefore ensuring an even spread of rows across pages with no "hotspots") and because PK's don't change very often (or shouldn't, at least).

As you indicated, a TopicID is a better candidate for your clustered index because you'll often be pulling an entire Topic worth of posts. Since they're close together physically, it saves a lot of IO. But if you don't have such a requirement, you're usually best sticking your clustered index on the PK so you've got a good random distribution across pages.

Yep, Makes sense.

dhromed

Second, fields do not exist in a relational database. Columns
do. There is a fundamental difference between the two terms. Ditto for
rows versus records. Usage of the proper ensures a strong mental model
of the domain; think mixing up your derivatives and integrals in
calculus.

I've never been formally schooled in databasing, nor have the kind of
in-depth knowledge displayed here so often, but in my experience I've
come to define Field as the intersection of a row and column in a
database, and a Record as a Row With Data. Synonyms mostly, but Field
and Record because it's a database table, not yer average table within
whatever context or environment.

The derivatives vs integrals comparisin means nothing to me. 8-|

apapadimoulis

@dhromed said:

I've never been formally schooled in databasing, nor have the kind of in-depth knowledge displayed here so often, but in my experience I've come to define Field as the intersection of a row and column in a database, and a Record as a Row With Data. Synonyms mostly, but Field and Record because it's a database table, not yer average table within whatever context or environment.

The derivatives vs integrals comparisin means nothing to me. 8-|

A lot of (non-database) people say "rows/records, fields/columns, tomato/tomatah". But there really are fundamental differences between the two. They're not synonymous at all. A non-calc example would be comparing a "type" (or "class") with an "object". Similar, but fundamentally different.

Correction for your terminology:

The intersection of a Row and a Column is a Value.
A Field, on the other hand, is an application-specific part of a Record.
A Row cannot, by definition, exist without data. Tables need not have rows, but a row must contain data.

And as an FYI, here's some more differences on Records vs Rows, Fields vs Columns. I wrote this in some other post here ....

Records are definied within the application (think COBOL or C-structs reading from binary files). Rows are definied in the database.

Records do not have consistant structure. In file systems, flags within a file record determine what data follow All Rows within a table are the same.

Records are ordered. Rows have no order.

An empty file (where you store records) contains nothing: zero bytes on disk. An empty table still has columns, permissions, constraints, etc.

Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)

Fields are ordered, columns are not

Fields cannot be "null" (what combination of bytes would you use to represent an integer null?); they can only contain a "dummy value" known by the application (999999999).

Fields can contain other records. Columns are always scalar (many programmers also make the mistake of the "CSV" column)

bmschkerke

@Alex Papadimoulis said:

Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)

Ah, my head.. :)

In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule? Or, what about a self referencing constraint? You have an Offices table, for instance, with a "Central Office" column. The CO column requires a value that exists within the Office table.

I guess I'm asking what you consider a "pointer" in terms of the database.

Thanks.

@bmschkerke said:

@Alex Papadimoulis said:

Most importantly, records contain other records (or a pointer to another record in another file). Rows cannot contain other rows and *should not* contain pointers to other rows (many programmers make this mistake by putting pointers in their rows known as GUID or IDENTITY)

Ah, my head.. :)

In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule? Or, what about a self referencing constraint? You have an Offices table, for instance, with a "Central Office" column. The CO column requires a value that exists within the Office table.

I guess I'm asking what you consider a "pointer" in terms of the database.

Thanks.

I was going to ask the same thing.

Also, I appreciate your discussion about the differences between "field" and "column," "record" and "row," etc. Because I wanted to know what the hell you were going on about, I read up on the subject and feel that I have really deepened my knowledge of the relational model. I saw things that brought back horrible, terrible memories from some of my more theoretical math classes in college. So, again, thanks for spurring me on to learning!

That said, I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row." With absolutely no intention of offending you, it seems a little odd. Like if you were sitting next to someone, reviewing some poorly written code, and he said, "what do you supposed he means to be doing with this class here" as he very clearly pointed to an object. Chances are, he's not a total idiot and he realizes the difference between an object and a class, so there's no need to scream, "That's an object, not a class! They really are very different things!" every time he slips.

Anyway, like I said, I sincerely am not trying to be rude or condescending. It's just that I find your behaviour in this instance puzzling, that's all. Also, please realize that I realize that it's proabably my own lack of experience that's causing your behavior to seem puzzling to me. I'm willing to bet I'll understand when I've got more experience.

apapadimoulis

@bmschkerke said:

I guess I'm asking what you consider a "pointer" in terms of the database.

A pointer is something that does not exist in the real-world problem domain and are not exposed to end-users of the system.

@bmschkerke said:

In reference to the above - if you're storing addresses in an Addresses table, with the primary key being an autoincrementing identity column, does using that identity column in the Orders table to locate the related address violate the above rule?

Possibly. Why do you need an IDENTITY column? Addresses are standardized and will usually a DUN numbers. If not, it's really a problem for the business to solve. Generally, they solve this with (Cust_No + Address_Seq).

apapadimoulis

@UncleMidriff said:

I don't understand why you make such a big deal whenever someone slips up and says "record" when they meant "row" ... I find your behaviour in this instance puzzling, that's all.

Programming is one of the few professions where any one can grab a compiler and make a living writing code. Plumbers, barbers, and even insurance CSRs are required to maintain a licence to do their day-to-day job. But unlike a bad haircut or a leaky faucet, bad programming can cause massive financial loss and even death due; I've personally seen the former and have heard quite a few anecdotes about the latter.

Over the years, I've come to realize that bad programming (at the level I mentioned) is rarely done by a lazy or incompetent programmer; it's done by programmers who have absolutely no idea what they are doing yet eagarly do it anyway. Worse still, they don't even realize that they don't know what they're doing.

To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.

And usually, I've find that when people don't know the basics, they are still doing it anyway as a professional. Myself included. I can only be thankful that I didn't cause too much financial loss (only an unmaintainable system here and there). I only wish someone "yelled" at me earlier!

@Alex Papadimoulis said:

@UncleMidriff said:
I don't
understand why you make such a big deal whenever someone slips up and
says "record" when they meant "row" ... I find your behaviour in this
instance puzzling, that's all.

Programming is one of the few professions where any one can grab a compiler and make a living writing code. Plumbers, barbers, and even insurance CSRs are required to maintain a licence to do their day-to-day job. But unlike a bad haircut or a leaky faucet, bad programming can cause massive financial loss and even death due; I've personally seen the former and have heard quite a few anecdotes about the latter.

Over the years, I've come to realize that bad programming (at the level I mentioned) is rarely done by a lazy or incompetent programmer; it's done by programmers who have absolutely no idea what they are doing yet eagarly do it anyway. Worse still, they don't even realize that they don't know what they're doing.

To my point, when I see people use the wrong terminology, it implies that they really don't understand the basics, yet are trying to do it anyway. Mathmeticians don't mix up sines and integrals; carpenters don't mix up biscuit joints and bevels; and database programmers shouldn't mix up fields and columns.

And usually, I've find that when people don't know the basics, they are still doing it anyway as a professional. Myself included. I can only be thankful that I didn't cause too much financial loss (only an unmaintainable system here and there). I only wish someone "yelled" at me earlier!

I see..that makes sense. Thanks for the explanation.

About pointers: If you have the time and the inclination, could you give an example of a few tables with rows that contain pointers and what you would do to make them not suck? Even though you posted about it twice, I'm still having trouble understanding exactly what it is that you mean. It almost sounds to me like you're saying that meaningless primary keys are a bad thing, but I figure if that's what you meant you would have come right out and said it already. If that is what you mean, I don't know what to think, because I've heard convincing arguments for both sides of that argument.

apapadimoulis

@UncleMidriff said:

It almost sounds to me like you're saying that meaningless primary keys are a bad thing ... I've heard convincing arguments for both sides of that argument.

Yes, that's pretty much another way of putting it. Non-domain keys are by definition meaningless and functionaly act as row pointers. Pointers (according to the information rule of RDMS) are non-relational.

When I first started hacking around in a database, I had no idea what I was doing and ended up adding a sequence on every table (this was back in the day when DBs did not easily support AUTOID columns). It just seemed the logical way to do things.

After studying relational theory and it's predecessors (CODASYL, IMS, etc), it became pretty clear how the relational model is a much better way of doing things. If you're interested in the topic, Elsmari and Nathe have a good textbook that goes in much detail about non-relational models.

You'll find that those who advocate pointer-based data modeling (IDENTITY, GUID, etc) are not seasoned data modelers nor do they have an understanding of the history of data modeling. If they were, they would have run into the same problem with their methods that the creators of the RM had back in the late 60's.

Mike_R

@Alex Papadimoulis said:

To my point, when I see people use the
wrong terminology, it implies that they really don't understand the
basics, yet are trying to do it anyway. Mathmeticians don't mix up
sines and integrals; carpenters don't mix up biscuit joints and bevels;
and database programmers shouldn't mix up fields and
columns.

I'm sure a lot of the confusion spurs form the programming side. Take,
for example, ADO, which allows interaction between the application and
a RDBMS. ADO refers to a set of rows as a recordset, and a set of
values as a list of fields.

I'm defending myself in this instance, I understand rows and columns,
and actually understand some of the finer differences of rows vs
records and columns vs fields, now thanks to you, but with APIs still
refering to things using the wrong terminology, it's very easy to get
tripped up.

Your opinion on IDENTITY fields is very interesting, and I have often
thought about just that, because it seems to be pretty much a habit of
the IT industry in general: Have a row of information? tack an ID on it
so we can get to it later. Its almost a mantra, though in some cases,
some kind of ID number would be useful: Take the example of a table to
customers. Using first and last name as the identifying piece of
information will not work, becuase there's always more than one John
Smith. Another piece of information is necessary, such as a phone
number. In this case, it could be simpler to select by an arbitrary
assigned value to look up a particular customer record. Also, in
tables, such as something like order history, the foriegn key would
require 3 pieces of information. It seems more efficient to store an
IDENTITY value, or some other unique value such as a GUID or hash of
the 3 pieces of identifying information (though a hash can, by its very
nature, collide...), In this case, it seems better just to go with
IDENTITY values. Am I wrong?

apapadimoulis

@Mike R said:

I'm sure a lot of the confusion spurs form the programming side.

Absolutely. The relational model just doesn't "make sense" from a procedural programming perspective. This is why it took 20 years for them to evolve out of earlier models.

@Mike R said:

ADO refers to a set of rows as a recordset, and a set of values as a list of fields.

I wondered that, too. I'm glad to see ADO.NET changed it, but perhaps one reason behind the original naming conventions was the generalized nature of ADO. Anyone could write a provoider from any source (from AD to to MAPI) and in many of those sources, tables are not the mechanism by which data is stored.

Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.

@Mike R said:

IDENTITY ... seems to be pretty much a habit of the IT industry in general

Some one else mentioned that to me a while a go, and I think it's an exposure thing. On big systems developed by experienced, data specialists, you'll never see an autonumber or equivilent as the system. These same experienced data folks don't write in general publications; they have their own set of magazines that us non-data specialists would never read. The people who do write in the general publications don't fall in the category of specialist; they seem to be more generalist. And you know the jack of all / master of none rule ...

@Mike R said:

It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong?

Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.

The nearest "cousin" to the relational model is CODASYL (aka "network database"), with the most popular platform being IDMS. I think Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL database have a DBKEY, which was a system generated hash based on their physical storage. Any record in the database could be retreived simply by knowing it's DBKEY; yoou didn't even need to know it's type (Customer, Product, etc). One of the huge advantages to this approach is that it was rediculously fast. The reason we don't have CODASYL databases anymore are because the of the inherent problems with pointer-based models: data integrity, pointer-chains, etc.

When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a relational database, you are essentially playing CODASYL. Worse, you're not using a CODASYL database to do do it, so you have to try to reinvent all the features they stuck in CODASYL databases to make pointer-based models work. As the system grows, it gets horribly ugly and the data begins to corode. And when it comes time for HIPPA/SOX/etc audits, whooo-boy: you don't even want to know how many hundreds-of-dollars an hour and how many hours it takes for data consultants charge to fix your system and bring it to complaince.

One of my favorite "massive financial loss" stories comes from a procedures system developed by a reputable consulting companies (for <$100k) that used the "ID everwhere" model (they obviously didn't bring in any data specialists). When the company failed an ISO audit, they had to pay over $500,000 to fix the data in this system. The data corrosion came from an incredibly tiny bug in the procedural code that, had they done the database right, would have only caused an error message to popup instead of data corruption.

Whenever I see people advocating for pointer-models, I think back to Santayana: Those who cannot remember the past are condemned to repeat it.

@Alex Papadimoulis said:

@Mike R said:
I'm sure a lot of the confusion spurs form the programming side.

Absolutely. The relational model just doesn't "make sense" from a procedural programming perspective. This is why it took 20 years for them to evolve out of earlier models.

@Mike R said:

ADO refers to a set of rows as a recordset, and a set of values as a list of fields.

I wondered that, too. I'm glad to see ADO.NET changed it, but perhaps one reason behind the original naming conventions was the generalized nature of ADO. Anyone could write a provoider from any source (from AD to to MAPI) and in many of those sources, tables are not the mechanism by which data is stored.

Another possibility is that "field" sounds more "techie" than "column". Tables are, afterall, those things in Word and HTML.

@Mike R said:

IDENTITY ... seems to be pretty much a habit of the IT industry in general

Some one else mentioned that to me a while a go, and I think it's an exposure thing. On big systems developed by experienced, data specialists, you'll never see an autonumber or equivilent as the system. These same experienced data folks don't write in general publications; they have their own set of magazines that us non-data specialists would never read. The people who do write in the general publications don't fall in the category of specialist; they seem to be more generalist. And you know the jack of all / master of none rule ...

@Mike R said:
It seems more efficient to store an IDENTITY value, or some other unique value such as a GUID or hash of the 3 pieces of identifying information (though a hash can, by its very nature, collide...), In this case, it seems better just to go with IDENTITY values. Am I wrong?

Exactly; it "seems" better, and "seems" more efficient. This is why the early data models did it this way.

The nearest "cousin" to the relational model is CODASYL (aka "network database"), with the most popular platform being IDMS. I think Oracle has/had a CODASYL db as well. Anyway, all records a CODASYL database have a DBKEY, which was a system generated hash based on their physical storage. Any record in the database could be retreived simply by knowing it's DBKEY; yoou didn't even need to know it's type (Customer, Product, etc). One of the huge advantages to this approach is that it was rediculously fast. The reason we don't have CODASYL databases anymore are because the of the inherent problems with pointer-based models: data integrity, pointer-chains, etc.

When you use pointers (i.e. GUID, IDENTITY, HASH, etc) in a relational database, you are essentially playing CODASYL. Worse, you're not using a CODASYL database to do do it, so you have to try to reinvent all the features they stuck in CODASYL databases to make pointer-based models work. As the system grows, it gets horribly ugly and the data begins to corode. And when it comes time for HIPPA/SOX/etc audits, whooo-boy: you don't even want to know how many hundreds-of-dollars an hour and how many hours it takes for data consultants charge to fix your system and bring it to complaince.

One of my favorite "massive financial loss" stories comes from a procedures system developed by a reputable consulting companies (for <$100k) that used the "ID everwhere" model (they obviously didn't bring in any data specialists). When the company failed an ISO audit, they had to pay over $500,000 to fix the data in this system. The data corrosion came from an incredibly tiny bug in the procedural code that, had they done the database right, would have only caused an error message to popup instead of data corruption.

Whenever I see people advocating for pointer-models, I think back to Santayana: Those who cannot remember the past are condemned to repeat it.

But how does using an indentity column or other form of meaningless primary key on a table lead to data corruption? I'm just not seeing it. The only thing in my admittedly limited experience that I have seen cause problems is when someone decides not to use a meaningless primary key. That has inevitably led to severe problems because whatever meaningful bit of data they chose for the primary key needed to be changed in some way later on in the life of the project. Granted, just because a few yahoos before me couldn't pick a proper, meaningful primary key doesn't mean that meaningful primary keys are bad, and I'm not trying to imply that. But I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

apapadimoulis

@UncleMidriff said:

I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

They realized long ago that, no matter how good the programmers are, you still will end up with defects in procedural "middle-tier" code. If it's the occasional error message or crash, it's rarely a big deal. When the defect causes data corruption (especially historical/complaince data), you have serious problems.

One of the biggest innovations with the RDBMS was that data integrity is handled in the database. Instead of relying on the middle tier, the RDBMS was responsbile for enforcing business rules on whether or not data is valid: everything from birthdays occuring at least eighteen years ago to orders having existing products.

When you use pointers to relate data instead of the data iteself, you lose the ability to maintain referential integrity. It's hard to see it in smaller systems. One of the simplist example comes from the case I mentioned (the $500k bug) and was a direct result of the inability to have DI at the database:

Everytime a procedure was changed, a new revision of the procedure was created and the old revision became inactive. Inactive procedures could not be changed and a trigger was used to enforce this ( IF EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).

Procedures had user-maintinable discrete domains (picklists, dropdowns) of various things. For example, Key_Output_Variables := (Id int, Variable varchar(15), Active_Ind char(1)). The procedure table referenced a KOV with a foreign key constraint.

If the user wanted to rename a KOV, the system would check if any procedures used the KOV and, if so, create a new KOV and inactive the old one, thereby ensuring all old KOV remain in the system. However, there was a bug in this maintence module that caused it to update the KOV anyway (in some cases, if the user did it in a certain way). There were a handful of test cases to test this but none of them caught the peculiar way of changing the KOV.

After five years of being in production, auditors discovered this by comparing paper and electronic copies of inactive revisions. The only way the could correct the problem (and pass ISO) was to take three years worth of tape backups, paper copies, etc, hire an army of temps and try to reconcile the thousands upon thousands of procedures.

Had they not used an ID pointer, the trigger would rejected the update before it snowballed into half a million. This is pretty much how it usuaully happens: a minute bug in the Middle tier slowly corrupting historical data.

All in all, if you use these non-relational techniques, you end up trying to have to put in gobs of work-around code (the bandaid for this problem was to put in a trigger that queried the meta-tables and used dynamic sql to check what tables used that the value) which (since it's lots of code) will be inherently prone to defects. But the true nightmare begins when you try to build a data warehouse.

Again, you see these problems in large systems. But, a lot of times, systems start out small and become large.

bmschkerke

I don't need an identity column, but didn't have a suitable primary key
replacement. Duplicate addresses are to be accepted, if
entered. We have an entity that has up to five addresses.
Rather create columns of StreetAddress1, StreetAddress2, City, State,
ZipCode five times I moved it into a separate table. Since each
entity can have multiple addresses I cannot use the entity table's
primary key as an address primary key. We deal primarily with
individuals and so don't track DUN numbers.

The customer number and sequence would work, possibly. However, I
have linked this table to other tables, not all of which share a
similar logical structure. (Not all addresses are related to
customers, for example.)

Should each table have a separate Addresses table, so that it can be
managed using a parent table's keys? Or denormalize it if they're
so closely linked, even if the information might often be null for the
optional addresses?

@Alex Papadimoulis said:

@UncleMidriff said:

I just can't see how meaningless primary keys, by their very nature, lead to data corruption.

They realized long ago that, no matter how good the programmers are, you still will end up with defects in procedural "middle-tier" code. If it's the occasional error message or crash, it's rarely a big deal. When the defect causes data corruption (especially historical/complaince data), you have serious problems.

One of the biggest innovations with the RDBMS was that data integrity is handled in the database. Instead of relying on the middle tier, the RDBMS was responsbile for enforcing business rules on whether or not data is valid: everything from birthdays occuring at least eighteen years ago to orders having existing products.

When you use pointers to relate data instead of the data iteself, you lose the ability to maintain referential integrity. It's hard to see it in smaller systems. One of the simplist example comes from the case I mentioned (the $500k bug) and was a direct result of the inability to have DI at the database:

Everytime a procedure was changed, a new revision of the procedure was created and the old revision became inactive. Inactive procedures could not be changed and a trigger was used to enforce this ( IF EXISTS(SELECT * FROM deleted WHERE Status='Inactive') ROLLBACK ).

Procedures had user-maintinable discrete domains (picklists, dropdowns) of various things. For example, Key_Output_Variables := (Id int, Variable varchar(15), Active_Ind char(1)). The procedure table referenced a KOV with a foreign key constraint.

If the user wanted to rename a KOV, the system would check if any procedures used the KOV and, if so, create a new KOV and inactive the old one, thereby ensuring all old KOV remain in the system. However, there was a bug in this maintence module that caused it to update the KOV anyway (in some cases, if the user did it in a certain way). There were a handful of test cases to test this but none of them caught the peculiar way of changing the KOV.

After five years of being in production, auditors discovered this by comparing paper and electronic copies of inactive revisions. The only way the could correct the problem (and pass ISO) was to take three years worth of tape backups, paper copies, etc, hire an army of temps and try to reconcile the thousands upon thousands of procedures.

Had they not used an ID pointer, the trigger would rejected the update before it snowballed into half a million. This is pretty much how it usuaully happens: a minute bug in the Middle tier slowly corrupting historical data.

All in all, if you use these non-relational techniques, you end up trying to have to put in gobs of work-around code (the bandaid for this problem was to put in a trigger that queried the meta-tables and used dynamic sql to check what tables used that the value) which (since it's lots of code) will be inherently prone to defects. But the true nightmare begins when you try to build a data warehouse.

Again, you see these problems in large systems. But, a lot of times, systems start out small and become large.

So, in this system, the values in the Variable column of the KOV table are supposed to be unique, and thus it could have been used as a primary key. Also, the users can change the value in the Variable column of a given row whenever the heck they please. Provided that these assumptions are correct, here's what I would have done:

Since the value of Variable for a given KOV is expected to change, and since I wouldn't want to change every row in every table that referenced a particualr KOV everytime someone changed the value of Variable for that KOV, I would have used an Id column as the primary key. Then I would have added a trigger that would whine if someone tried to add or change a KOV with the same Variable value as an existing KOV. To keep a history of changes, I would have added another table, named something like "KOVChangeLog," in which I would store the Id and old Variable values of the changed KOV, as well as the user who changed it and the time at which it was changed.

When someone changes a KOV, the system would check to see if any procedures reference that KOV. If so, the KOVChangeLog table would have added to it the Id and old Variable value of that KOV along with the user responsible for and the time of the change, and then that KOV would be updated. If not, that KOV would just be updated. The trigger would ensure that the user wouldn't be able to change that KOV to a KOV that already exists, and the meaningless Id column used as the primary key would ensure that every row in every table referencing that KOV wouldn't have to be updated everytime that KOV might be changed.

I apologize if the above is a huge WTF for any reason, and I offer the excuse that I haven't been at this very long.

Another question: how is the use of pointers "non-relational" according to the Information Rule? I'm assuming here that you are talking about Codd's 12 Rules.

@Wikipedia.org said:

The information rule:

All information in the database to be represented in one and only one way, namely by values in column positions within rows of tables.

Perhaps I'm just dense, but I don't see how using meaningless primary keys violates this rule.

apapadimoulis

@bmschkerke said:

The customer number and sequence would work, possibly. However, I have linked this table to other tables, not all of which share a similar logical structure. (Not all addresses are related to customers, for example.)

Should each table have a separate Addresses table, so that it can be managed using a parent table's keys? Or denormalize it if they're so closely linked, even if the information might often be null for the optional addresses?

You're on the right track by putting it in a (single) addresses table. Two things to think about are orphaned addresses (an address without a Customer, Vendor, etc) and shared addresses (an address shared between a Customer and a Vendor). Both may or may not be problematic depending on your problem domain.

apapadimoulis

@UncleMidriff said:

I would have used an Id column as the primary key. Then I would have added a trigger that would whine if someone tried to add or change a KOV with the same Variable value as an existing KOV. To keep a history of changes, I would have added another table, named something like "KOVChangeLog," in which I would store the Id and old Variable values of the changed KOV, as well as the user who changed it and the time at which it was changed.

This is almost identical to the approach that was used. The problem came with implementing this the requirement "Renaming a KOV: if an inactive procedure exists that uses the original name, then a new KOV is created and assigned to all active/pending procedures that used the original. If no inactive procedure exists, the original KOV is renamed." The actual requirement and system was a bit more complicated, but suffice it to say that a loophole existed that allowed KOV to change despite having procedures that could not change. The procedure didn't know the KOV changed because the pointer to it remained the same.

As far as triggering the KOV table, the general practice with DI is to have the referencing table complain about changes, not the referenced table. There are good reasons for this, the strongest being it's much less prone to errors and scalability. You don't want to have to modify a table everytime another table wants to reference it.

Also, a thing to note about audit trails: they should be stored externally to the database. It's the same principle as keeping backups and originals on the same disk.

@UncleMidriff said:

Since the value of Variable for a given KOV is expected to change, and since I wouldn't want to change every row in every table that referenced a particualr KOV everytime someone changed the value of Variable for that KOV

Why not? That's *exactly* what you want to happen, because that's *exactly* what is happening in "real world". If the user changes something here and expects it to change there, then it effectively changes in both places. DRI actions make this so much easier to accomplish.

@UncleMidriff said:

Another question: how is the use of pointers "non-relational" according to the Information Rule? I'm assuming here that you are talking about Codd's 12 Rules.

Whoops, actually, I was thinking of another one, the "Guaranteed Access rule." And now that I look up the "12 rules" myself, I don't think it's very clear reading it as a bullet point either:

- All data should be accessible without ambiguity. This can be accomplished through a combination of the table name, primary key, and column name.

However, the no pointers thing is really a fundamental component of the relational model. Unlike heirarchical and network databases (the two predecessors to relational), data are referenced by their value, not pointers. That was actually one of the largest critisisms of the relational model back in its day, especially due to the limited disk and processing power of computers.

unordained

[Crap -- posted this as a new topic, too. Someone please delete that one? My apologies.]

First of all, if you're going to tell people to use correct vocabulary, you should get it right yourself. Columns, rows, and tables are all terms borrowed from the spreadsheet world. The only one you avoided using was 'cell', and used 'field' instead. You are correct that there is no order to them, but in spreadsheets, there is -- which is why re-using the same terms is a bad idea.

Database values, database variables, database domains (headers).
Relation values, relation variables, relation domains (headers), relation body.
Tuple values, tuple variables, tuple domains (headers).
Attributes, domains, values.

Those are the terms you should be using. A database value is a named set of relation variables. A relation variable always has a value that is a set of tuples whose headers all match the relation variable's type's header. And so forth.

But do we care? No.

As to autonumber fields, Alex, you're again wrong. They're called surrogate keys -- that is, candidate keys added to a relation in order to simplify things. You may have any number of candidate keys on a relation header, each with any number of fields. Each one is its own unique constraint on the tuples of the relation's body. One of them is picked (usually, but optionally) as the primary key. If you do pick one as a primary key, you'll use those same attributes in all referencing relations. But you're not required to. If you have two different unique ways of addressing a row, you can use either of them, interchangibly, anywhere in the database. An autonumber field is just one example of defining a surrogate key, that is, an invented candidate key.

Codd's rules require that any value in the database be addressable by the relation variable name and values for at least one candidate key, plus the attribute name for which you want to retrieve the value. The idea was that everything was "out in the open". You can see autonumber values, you can manipulate them. They are not the same thing as pointers, which are hidden and opaque.

In CODASYL, pointers were external to the tuple, not internal. When you had a tuple, you didn't have the pointer -- the pointer was not part of the information. It was physical, not logical. You had to follow pointers around to get rows. That's not what autonumbers are about. When you get a tuple from a relation that uses autonumbers, you have the number. It's part of the information. Nothing is hidden. You can change that number without breaking pointers -- most database systems will automatically update all FK fields for you, though that's not required. What is required is merely that the database system enforce constraints, and in the case of FK's, that should be "values of these attributes must match values of those attribtes". That is, a DBMS is only required to reject a COMMIT if you change the PK of a row currently referenced elsewhere. The rest is kindness.

In CODASYL, you had to navigate the records by following pointers around between rows in various tables. A master row would maintain a pointer into its first child, the children would have pointers to each other, and then there'd be another pointer back to the master row. That's what pointers are about. Order, physical, opaque. With autonumbers, we're talking about "find where this is true" -- no order, no physical aspect, and it's not opaque. It's a key, not a pointer.

The problem you described wasn't a problem of autonumbers at all. The problem was the rest of the code. You failed to maintain a proper set of candidate keys on your lookup tables -- one for the surrogate (autonumber) attribute, yes, but another for the text attribute that was being changed. Don't blame autonumbers for your mistakes. Autonumbers don't naturally lead to data corruption as such. If you forget to properly model the rest of the data, then yes, there can be problems -- but that'd be the case regardless. (Actually, someone else reading your description understands you to have been making entire copies of the lookup table, under a different name each time? If so, you were -way- off course, and I'm not sure I can properly explain how much.)

While speaking of misunderstanding about the relational model, we should also address the CSV point you brought up earlier. CSV attributes are, in fact, allowed by the relational model. The requirement is merely that they be treated atomically. That is, the value "A,B,C" is not the same as the value "A,B", and any tuples containing such a value are just one tuple each, not two or three. The relational model does not, however, prevent you from defining functions such as IS_SUBSET_OF that would allow you to do joins between relations across CSV columns, if you should so desire. You should be able to do, for example, SELECT A.*, B.* FROM A INNER JOIN B ON IS_SUBSET_OF(B.FIELD_B, A.FIELD_A);. The requirement merely prohibits you from treating a single tuple as if it were several, automatically breaking up a CSV attribute into non-CSV values for several virtual tuples. In fact, a lot of work has gone into making sure the relational model is properly understood to allow complex types, not just 'text', 'date', and 'integer'. Arrays, sets, images, polygon descriptions, etc. are all theoretically allowed, so long as you define proper functions for manipulating them. CSV is just one such example, albeit probably a slow one. (CSV is mainly human-readable -- you could come up with slightly better general-purpose set or array types to use.)

Autonumbers are:
a) faster, in many cases, than comparing text values
b) smaller, less footprint
c) allowed by relational theory
d) single field vs. multi-field FK's, thus faster
e) allow you to change uniqueness constraints in the referenced relation without changing the headers for every referencing relation -- particularly useful when requirements change slightly or previously-unknown outlier cases are discovered

You should always remember to add proper constraints to the referenced relation, of course. But there's more to constraints than merely using a primary key built out of existing attributes; making sure sums remain positive, that things balance, that certain dates always maintain a certain relationship between them (date_of_death > date_of_birth, for example), etc.

I've seen people try to get by without autonumbers when they had improperly identified candidate keys. Relations with 16 fields making up the primary key -- and they were lucky, this particular system allowed NULL states for attributes in the primary key (this is normally a bad thing, as NULL != NULL, which should make such rows impossible to address.) Despite this, we still had cases where we needed to enter two tuples with the same PK value, as distinct tuples. And it creates a mess -- every relation referencing this one now has 16 fields devoted just to that. That's a lot of data, and it's still wrong.

You fear autonumbers because of pointers, but you misunderstand the distinction. Come to think of it, if autonumbers are so evil ... why do we use, instead, autonumbers provided to use by the government? Social security numbers, federal tax id's, etc. are all just forms of autonumbers in their database. And on top of that, theirs aren't necessarily unique. (There have been documented cases of two people having the same SSN and DOB, yet being entirely unrelated otherwise. Don't trust SSN to be unique.)

Grab a copy of 'Foundation for Object/Relational Databases -- The Third Manifesto' (C.J. Date and Hugh Darwen) if you haven't already. They'll clear up your misconceptions for you, though they'll add a few too (their opinions on inheritance should be ignored, for example.) The relational model has fewer requirements than many people believe, and is therefore much more flexible than they've thought.