We need duplicate tables all over the database



  • Disclameir: I don't understand much about databases so TRWTF might be my lack of knowledge.

    So there's a need for a sign up system for applicants and despite we're the information security team we're going to develop this (possibly another WTF, I guess, but I'm new here so what do I know). So according to my boss there's going to be a "user" table, and then an "applicant" table. The data from the "user" table (such as name, address, ID number, etc etc etc) is going to be copied* into the "applicant" table. Not referenced, but rather copied.

    *(there will be more data on the "applicant" table along with the copied data)

    The reason for this? The data will be used for rather serious and official stuff, and my boss wants to avoid situations like, let's say, hipothetically:

    [hypothetical]

    • applicant signs up,
    • official application papers are sent to the address provided,
    • applicant logs in and changes address,
    • applicant sues the institution because the official application papers were sent to the "wrong" address,
    • upon further inspection, the applicant address present on the database is different from the address the papers were sent to,
    • chaos ensues.

      [/hypothetical]

    (Other possible cases are name changes after marriages, etc).

    The solution to this? User fills a form when signing up, double-check his personal data, clicks "OK" and then cannot change it. Personal data gets saved to "user" table. This table will be used only for login purposes. Then the registered user can pick an open position (or more) and apply for it. For each new application, user has to re-enter his personal data -- name, address, all the same stuff again (I'm guessing it will be auto-populated from the user table) -- then double-check it again, clicks OK, and then, guess what, cannot change this set of data too. If the applicant ever needs to make changes or corrections, then he's got to manually fill a form and phisically deliver it, in person, with attached copies of personal ID's or some other bureacracy like that.

    Althouhg I can understand the bureacratic need for following the process, and my boss's needs for avoiding future f#$%-ups, is that how things are properly done? As I said, I don't know much about databases, but isn't the proposed approach just creating lots of duplicate, dereferenced, potentially inconsistent data? Is this the only way, or the best way, for dealing with this?



  • Lots of different options for tracking changes in address. I would probably do something along the lines of:

    User table: UserID, Name, etc.
    Application Table: UserID, ActiveFlag, positionAppliedFor, dateApplied
    Address Table: UserID, ActiveFlag, Added_Date, SeqNum, Type (Home, Business, etc), Street, etc
    Phone Number Table: UserID, ActiveFlag, Added_Date, Type (Cell, Home, etc)

    Make each UserID only have one "Active" application, address, phone number, etc.

    I've also seen "ActiveFrom" and "ActiveTo" fields used... change ActiveTo field to yesterdays date, add new record that's active today through 4001-01-01:

    AddressTable: UserID, ActiveFromDate, ActiveToDate, type, Street, etc



  •  There are places where data normalization needs to be relaxed. A common one is the 'invoicing application' that is taught in school. You know, items table, customers table, linked together through 'invoices' and 'line items' tables. To view an invoice you pull customers info from the customers table, and descriptions and prices from the th items table, right? Wrong! prices change, but invoices don't, so you have to copy the current price into each 'line items' entry. In some situations, the need to be able to produce a perfect copy of that invoice later means that you may need to copy alot of data into the invoices tables. 

    Perhaps a better way would be to maintain historical records: [customer.current_address]->[addresses.id], old addresses are retained, never altered (edits blocked by permissions or, maybe even constraints?), indexed on addresses.user_id, and letters are linked to both customers and addresses. That strikes me as cleaner, but I, like you, do not work in the database field.

     



  • @atipico said:

    Disclameir: I don't understand much about databases so TRWTF might be my lack of knowledge.

    There's the first clue: you're smart enought to assume you might be the problem. The realy jucy WTFs tend to come from people who don't ponder such things.

    If you need to keep historical records of address changes, why not do so in a history table? User ID links into the user's history of address changes. Every time the user changes address (or whatever other items need to be tracked) the "old" info is entered in the history table as a new entry. Much cleaner, doesn't involve massive duplication, etc. Ideally the date/time of change would be included in the history table, so in the hypothetical situation, there's a record allowing the institution to show they sent the papers to whatever address was current.

    I'm not a DB guy either, but I did stay at a Holiday Inn Express last night!

    (edit: boy I type slow in the evening...)


  • Considered Harmful

    Yes, the proper way is to have "versioned" tables. Addresses (for example) are never edited in place, but a new version record is created and referenced. You can store the date when each version became active, giving you a historical record of what address was current at such-and-such time, or you can have things like invoices, packing slips, audit tables etc reference a specific (then-current, but possibly not now-current) version of the address.

    You maintain referential integrity, and also audit-ability, and avoid redundant storage of data (a normalization no-no, one of the worst data sins is being able to ask the same question 2 ways and get 2 conflicting answers).

    Joins can get to be a pain in this model, but views can help with that.



  • @atipico said:

    The solution to this? User fills a form when signing up, double-check his personal data, clicks "OK" and then cannot change it. Personal data gets saved to "user" table. This table will be used only for login purposes. Then the registered user can pick an open position (or more) and apply for it. For each new application, user has to re-enter his personal data -- name, address, all the same stuff again (I'm guessing it will be auto-populated from the user table) -- then double-check it again, clicks OK, and then, guess what, cannot change this set of data too.

    You don't explain exactly what happens here: an insert of the new records? an update with last_update date?

    I've found my self in this situation several times and the "best" solution have been to keep a historic table doing something like: insert into historic_user, update user set active = 0, insert into user new_user, delete from user old_user

    Why is it best? Basically because this historic table will be used only at very specific times and you don't want all that unused data in a table that's going to be hit very hard every day. And no, it's not a WTF since the requirement is very specific, you've got the tools to do it properly and it's a not uncommon case.



  • The fact that your boss is thinking of the table that records the applications as an "applicant" table rather than an "applications" table says to me that WTF is on its way.

    Given your need for accountability and auditability, it seems to me that making a hard and fast rule that DB rows shall only ever get added, never updated or deleted, should push the design in the correct direction. That, along with consistent use of artificial primary keys generated strictly sequentially, should let you build something nicely normalized where it's still easy to query for the latest available data.



  • @ubersoldat said:

    Basically because this historic table will be used only at very specific times and you don't want all that unused data in a table that's going to be hit very hard every day.

    Is leaving unused historical data in a heavily used table really that much of an issue? I would have thought any halfway decent DBMS would have caching that works well enough to deal with 99% of it. I would also have thought that applicants are not actually very likely to change their contact information frequently enough to bloat the tables all that much. The separate historical table looks like premature optiimization to me.



  •  CREATE TRIGGER ON Users FOR UPDATE, DELETE AS {copy record to historical table here}


  • Fake News

    @flabdablet said:

    I would have thought any halfway decent DBMS would have caching that works well enough to deal with 99% of it. I would also have thought that applicants are not actually very likely to change their contact information frequently enough to bloat the tables all that much. The separate historical table looks like premature optiimization to me.
    Agreed on all fronts. Also, as long as the table had a version number field that was added to the PK, proper indexing could easily take care of any performance issues.

     


  • Fake News

    @Sir Twist said:

    CREATE TRIGGER ON Users FOR UPDATE, DELETE AS {copy record to historical table here}
    In general, triggers are fucking retarded. They solve the wrong problem while also introducing slowdowns. A properly designed system would eliminate the need for triggers.

     



  • @flabdablet said:

    Is leaving unused historical data in a heavily used table really that much of an issue? I would have thought any halfway decent DBMS would have caching that works well enough to deal with 99% of it.
     

    Yes, they optimize, no, it's not enough. They'll still spread your data all over the disk, slowing some kinds of access. If those changes are enough of an issue, the correct thing to do is to create another table.

    @flabdablet said:

    I would also have thought that applicants are not actually very likely to change their contact information frequently enough to bloat the tables all that much. The separate historical table looks like premature optiimization to me.

    Now, I have the same impression here. I'd go with a history table and a view, and change to a current table with a trigger only if needed.

     



  • @lolwhat said:

    In general, triggers are fucking retarded. They solve the wrong problem while also introducing slowdowns. A properly designed system would eliminate the need for triggers.
     

    In general the Universe is composed of dark energy and dark matter. Also, in general any random person is already dead.



  • @atipico said:

    So there's a need for a sign up system for applicants and despite we're the information security team we're going to develop this (possibly another WTF, I guess, but I'm new here so what do I know).

    Since it's a project which deals with manipulating all kinds of highly sensitive data, your team should be heavily involved in the design and review all code anyway. Getting you to write the code as well doesn't seem too bad, as long as the team's big enough that people won't be reviewing their own code.



  • @lolwhat said:

    @Sir Twist said:

    CREATE TRIGGER ON Users FOR UPDATE, DELETE AS {copy record to historical table here}
    In general, triggers are fucking retarded. They solve the wrong problem while also introducing slowdowns. A properly designed system would eliminate the need for triggers.

     

    [Citation needed]



  • @lolwhat said:

    Also, as long as the table had a version number field that was added to the PK, proper indexing could easily take care of any performance issues.

    Given that user details won't be updated on anything like a regular schedule, surely a per-user version number field would serve only to allow that user's rows to be retrieved in historical order; the actual numbers wouldn't mean anything. What could an explicit version number achieve that a simple sequential integer PK could not?


  • Fake News

    @flabdablet said:

    What could an explicit version number achieve that a simple sequential integer PK could not?
    Use the creation date of the new record.

     



  • @lolwhat said:

    @flabdablet said:

    What could an explicit version number achieve that a simple sequential integer PK could not?
    Use the creation date of the new record.

    As part of the PK? Why?


  • Fake News

    @flabdablet said:

    As part of the PK? Why?
    Why does one use a PK?



  • Say, atipico, you don't happen to work for a company which has "talent" on its name, do you?



  • @lolwhat said:

    Why does one use a PK?

    Not sure if this is a trick question, so I'll bite. As I understand it, a PK exists to allow for unambigious identification of a given table row, so as to make it possible to join it reliably to related rows in other tables. It's also usually indexed, allowing fast lookup of rows by their PK. If the PK is an integer, and known to have been created by a sequence generator, it also lets you determine the order in which rows were added to the table, so you can do stuff like looking up the last-added row for any given user ID. So I'm still completely unclear on why you seem to be suggesting it would be a good idea to make a composite PK including an explicit per-user version number and/or timestamp, as opposed to a single sequential integer artificial PK. What am I missing here?


  • Considered Harmful

    I'm sure this will infuriate someone, but I'd go so far as to say a timestamp or datetime value is almost always a poor choice of PK. (Date without time might make sense in some uncommon circumstances.)

    Time in general is a pain-in-the-ass to work with properly, because of things like timezones, daylight savings, precision (seconds? milliseconds? microseconds? ticks?), etc. You don't want to have to think about those things every time you're doing a join.

    More importantly, you're just asking to run into concurrency problems and race conditions.



  • @joe.edwards said:

    Time in general is a pain-in-the-ass to work with properly, because of things like timezones, daylight savings, precision (seconds? milliseconds? microseconds? ticks?), etc. You don't want to have to think about those things every time you're doing a join.

    Filed under: Would 5PM CST join successfully to 6PM EST?

    If it's a primary key then you either configure the database to local time or GMT and stick with that. Do timezone conversions in the presentation layer. Let the database use whatever internal format it likes at whatever precision it likes. Just don't change it after going live with production data. I've seen one example where a global company had a data centre in Chicago but no actual operations in Chicago or that entire time zone. The database used Chicago time and that was what you had to use for joins.

    The strange part was the backup data centre is in San Diego. I never worked out if it used local time or Chicago time. I expect that the people who looked after it had actually tested it.

    To answer the original question, I'd like to ask a few more questions: How big is this? (How many thousands of applications?) and How long do you expect it to be active? (Just until the next election?) That would help you decide to do the detailed history-logging or keep it simple with duplicated data. I've seen these sorts of problems for real (financial papers sent to the ex-wife's house) so it is certainly a real problem to be addressed. The "invoices" answer is probably the right one. Keep a copy of what was sent out, even if it's a paper copy in a physical file.



  • @lolwhat said:

    A properly designed system would eliminate the need for triggers.
     

    Surely a properly-designed system wouldn't experience slowdown when using triggers.

    They're the most effective method I know of to add change auditing to a RDBMS.


  • Considered Harmful

    @Qwerty said:

    Chicago time

    I've noticed operating systems ask me for the nearest major city instead of my timezone; I've always wondered why this is the case. I definitely have to think a little harder to figure out which city is geometrically closest to me (I live in a major city but it's rarely on their list), than I would to just pick my timezone. What's up with that?</tangent>



  • @joe.edwards said:

    I've noticed operating systems ask me for the nearest major city instead of my timezone; I've always wondered why this is the case. I definitely have to think a little harder to figure out which city is geometrically closest to me (I live in a major city but it's rarely on their list), than I would to just pick my timezone. What's up with that?</tangent>

    Because "timezones" are vertical stripes on the map but different cities (usually states, but it doesn't always follow state boundaries) have different daylight-savings rules. Australian eastern time is split into 3 zones by the daylight-savings rules and there's at least 2 other countries with the same timezone - not even in the same hemisphere.

    And most users won't know what timezone they're in but they do know what city they're near.



  • @atipico said:

    Althouhg I can understand the bureacratic need for following the process, and my boss's needs for avoiding future f#$%-ups, is that how things are properly done?
     

    No. Fundamentally it sounds like you want to track changes to data so that although the most recent record is displayed, historical changes to that record can also be displayed for auditing purposes.

    @atipico said:

    isn't the proposed approach just creating lots of duplicate, dereferenced, potentially inconsistent data? Is this the only way, or the best way, for dealing with this?

    It sounds like your boss has arrived at a solution without properly explaining (or defining) the problem. TRWTF is that he should really explain what the issue is and why it's an issue then leave it to the architects to propose a solution or two, explain the benefits and consequences of selecting one over the other than allow him to decide which should be implemented.

    Most databases have some feature to record changes to data - as someone mentioned, using a trigger is one method. Flicking on FLASHBACK in Oracle allows transaction auditing is another, but it builds up large (internal) tables of all transactions so could be a performance hit. One question should really be: how often is historical data queried, and why? If it's rarely then chucking the old record into a history table along with identifiable information (user, IP, timestamp, etc) should be sufficient to report against changes and show evidence that someone's taking the piss.

    But it does sound like the current method is going to cause everyone to do things in triplicate in the rare case that one piece of paper gets manually amended.



  •  @lolwhat said:

    @flabdablet said:
    As part of the PK? Why?
    Why does one use a PK?

    To uniquely identify the row.. but that doesn't explain the purpose of using the data types you propose.

    I'd settle with a sequential number PK and another column showing a timestamp when it changed.


  • BINNED

    @Sir Twist said:

     CREATE TRIGGER ON Users FOR UPDATE, DELETE AS {copy record to historical table here}

    Sir, I'm gonna have to ask you to step away from the keyboard...Your admin rights are revoked, effective immediately.



  •  If the official applications were sent by registered mail it would solve the problem of tracking when the applications were mailed. More cost, but less hassle down the road.

    A sane policy applicatants would need to agree to would also reduce future hassle...

     


Log in to reply