The Daily WTF: Curious Perversions in Information Technology
Welcome to TDWTF Forums Sign in | Join | Help
in Search

Phone book of 1 000 000 entries in 1Mb

Last post 12-30-2008 3:12 PM by m0ffx. 33 replies.
Page 1 of 1 (34 items)
Sort Posts: Previous Next
  • 11-21-2008 11:08 AM

    Phone book of 1 000 000 entries in 1Mb

    Hello everybody.

    I'm new here but I would like to submit this question I had once in an Interview : 

    "How would you store 1 000 000 entries of phone numbers (with 8 digits, that was the rule in France until 1996) in 1 Mega byte ?"

    I must admit I failed this one. Would you have succeeded ?...

     

    Regards,

    Glat'

  • 11-21-2008 11:30 AM In reply to

    • PJH
    • Top 10 Contributor
    • Joined on 02-14-2007
    • Newcastle, UK
    • Posts 1,256

    Re: Phone book of 1 000 000 entries in 1Mb

    I'd hire Earl Bradley of WEB Technology to do it for me. Or maybe Jules Gilbert.

    Abstinence makes the Church grow fondlers.

    - unknown
  • 11-21-2008 11:32 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Interesting... it depends very much on the full details of the problem; for example "0100 0000 to 0199 9999" is a million 8-digit numbers in 22 bytes, but not very useful!

    More details please...
    Linux is not a code base. Or a distro. Or a kernel. It's an attitude. And it's not about Open Source. It's about a bunch of people who still think vi is a good config UI.

    Notice: Phorm, and its agents including ISPs collecting data on Phorm's behalf, are specifically forbidden from performing any processing or monitoring of the content of the above post. Hence, under the Regulation of Investigatory Powers Act 2000 any such attempt to profile this page by Phorm or its agents is illegal.
  • 11-21-2008 11:51 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Well, it's been a while... If I remember correctly, it must be searchable (for example : give me the phone number #666 of 1 000 000 :)

    I can't recall the answer they gave me, but it wasn't about strings, I guess. Otherwise it's simple : stringify and cat all the number entries together and zip it with some efficient tool (Huffman code or whatsoever). Then, hope (or prove) that it doesn't run beyond the limit. I tried with all the possibilities I know of, but in vain. I guess one must use bit arrays, but I don't know how to go under this limit of 1Mb...

  • 11-21-2008 12:33 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    glatapoui:
    "How would you store 1 000 000 entries of phone numbers (with 8 digits, that was the rule in France until 1996) in 1 Mega byte ?"
    We're talking about this in IRC (#tdwtfmafia on irc.slashnet.org if you want to hop in).

    Each 8-digit number can take up 8 bits per character if you store them as strings, but if you sort them and store the differences between the numbers instead, you'll only need 7 bits per number, on average. 7 bits per number for 1 million numbers is 7/8ths of a megabyte, but that doesn't take into account delimiting the differences (since they can have variable size).

    Other than that, compression is always an options, but requires more than a megabyte of working memory to scan through. The store-the-differences option can't handle direct lookups, but it can calculate the nth phone number from the previous n-1 numbers in under a megabyte of space.

    But it's a pretty crazy question in the first place.

    My first instinct was to try storing the values in a trieve with a specialization of only dealing with the characters 0-9 (so 4 bits per character). Using this method, you would theoretically need to store only 1 111 111 characters, each at 4 bites, meaning it would take up slightly more than half of a megabyte. This one wouldn't allow you to search numbers by index, but given a number it would let you know whether it exists or not. If searching is a requirement then this one isn't an option.

    Then again if searching is required the store-the-differences method isn't an option either since it requires sorting.

  • 11-21-2008 2:30 PM In reply to

    • Nelle
    • Top 100 Contributor
    • Joined on 11-08-2007
    • graz.at.earth.milkyway.universe
    • Posts 285

    Re: Phone book of 1 000 000 entries in 1Mb

    Welbog:
    This one wouldn't allow you to search numbers by index
     

    why not ? 

     

  • 11-21-2008 2:54 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Nelle:
    Welbog:
    This one wouldn't allow you to search numbers by index
    why not ?
    I guess I'm using the wrong word for it. The correct term is "trie" instead of "trieve". That's what I get for knowing the origin of the word rather than the word itself. The implementation details answer your question trivially:

    http://en.wikipedia.org/wiki/Radix_tree

    http://en.wikipedia.org/wiki/Trie

  • 11-21-2008 3:20 PM In reply to

    • PJH
    • Top 10 Contributor
    • Joined on 02-14-2007
    • Newcastle, UK
    • Posts 1,256

    Re: Phone book of 1 000 000 entries in 1Mb

    Welbog:
    We're talking about this in IRC...

    Welbog:
    specialization of only dealing with the characters 0-9 (so 4 bits per character).

    And the phrase 'binary coded decimal' (BCD) didn't rear it's head?

    Abstinence makes the Church grow fondlers.

    - unknown
  • 11-21-2008 3:29 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    PJH:

    Welbog:
    We're talking about this in IRC...

    Welbog:
    specialization of only dealing with the characters 0-9 (so 4 bits per character).

    And the phrase 'binary coded decimal' (BCD) didn't rear it's head?

    I honestly don't understand the point of this comment.  Are you trying to show off that you know the formal name of such a storage method?  Because we all know that, too.  Or are you trying to imply that BCD is somehow bad for this?  Because when storage constraints are so strict, you'd be insane to waste as much space as you use.
  • 11-21-2008 3:44 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    ASCII generally compresses down to about 1/10 its original size when zipped [citation needed] so I say take the <8MB ASCII file and zip it. 

    You might have better luck using PACKED-DECIMAL (yes I'm using the COBOL term for it) but I don't know anything about the compression rates for that.  It would start at <4MB and need to be compressed less.

    Or whatever PJH decided to call it.

    SpectateSwamp exposing aliens. Obviously the World needs SSDS


    [10:07] <fatdog> so from now on.. be sure to wear nice clean underwear
    [10:07] <mps> fatdog: That is simply not going to happen
  • 11-21-2008 3:58 PM In reply to

    • PJH
    • Top 10 Contributor
    • Joined on 02-14-2007
    • Newcastle, UK
    • Posts 1,256

    Re: Phone book of 1 000 000 entries in 1Mb

    bstorer:
    I honestly don't understand the point of this comment.
    OP mentioned they had a talk in IRC(!), they mentioned something similar to BCD, but didn't mention BCD.
    bstorer:
    Because we all know that, too.
    Given this particular cicumstance, I'd suggest 'we' don't.

    Feel free to correct me if you were on this IRC session.

    Abstinence makes the Church grow fondlers.

    - unknown
  • 11-21-2008 4:01 PM In reply to

    • PJH
    • Top 10 Contributor
    • Joined on 02-14-2007
    • Newcastle, UK
    • Posts 1,256

    Re: Phone book of 1 000 000 entries in 1Mb

    belgariontheking:
    Or whatever PJH decided to call it.

    BCD. I didn't decide the name. Before my time.

    2 digits per byte. Insufficient given the criteria.

    Abstinence makes the Church grow fondlers.

    - unknown
  • 11-21-2008 4:03 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    PJH:
    OP mentioned they had a talk in IRC(!)
    Is there some new definition of OP that doesn't mean "The guy (or post) that started this thread?"

    SpectateSwamp exposing aliens. Obviously the World needs SSDS


    [10:07] <fatdog> so from now on.. be sure to wear nice clean underwear
    [10:07] <mps> fatdog: That is simply not going to happen
  • 11-21-2008 4:10 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    PJH:

    bstorer:
    I honestly don't understand the point of this comment.
    OP mentioned they had a talk in IRC(!), they mentioned something similar to BCD, but didn't mention BCD.
    bstorer:
    Because we all know that, too.
    Given this particular cicumstance, I'd suggest 'we' don't.

    Feel free to correct me if you were on this IRC session.

    I still don't get what your point is other than flexing your big, sexy brain.  Who cares what it's called?  It doesn't matter if it's got a name or not, the math still comes up the same.  For the record, though, I was in the channel at the time, but not participating in the conversation.  The term "BCD" did not come up.  I know the term, but even had I been in the conversation, I wouldn't have bothered with it.  Welbog has since informed that he wasn't familiar with the term, which surprised me, because it's not an especially esoteric term.  But his point was the same: "I've probably heard the term. Like I said, terms aren't important to me. The fact that you need 4 bits to store 10 values is what's important, regardless of what it's called."

  • 11-21-2008 8:12 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    The only way I can think of is like this: there are 10^8 C 10^6 = 10^8! / (10^6! * (10^8 - 10^6)!) possible ways of choosing 1 million distinct 8-digit numbers in, say, ascending order. That's a number that requires slightly more than 8 million bits (Python and me make it 8079302.3053085813), less than a megabyte (okay, 'mebibyte'). So there's a unique 1-megabyte number for every possible list, and a unique list for (nearly) every 1-megabyte number. Of course, you'd have to do a bit of processing to actually search for the n-th phone number in a list encoded this way. Strange interview question though. How much time did they allow you?
  • 11-21-2008 9:25 PM In reply to

    • Nelle
    • Top 100 Contributor
    • Joined on 11-08-2007
    • graz.at.earth.milkyway.universe
    • Posts 285

    Re: Phone book of 1 000 000 entries in 1Mb

    glatapoui:
    I can't recall the answer they gave me, but it wasn't about strings, I guess.

    if you need to save 1000000 phone numbers in 1 Mb RAM, that gives you approx. 1 byte per 8 digit number...

    however to save one 8 digit number you need at least 3 bytes + change ...

    it all boils down to the entropy of the list, but in the worst case scenario (IMHO) it is impossible ...


    you could save the differences as welbog sugested, but I would modify his approach and require a presorted list in order to save the differences of the entire telephone numbers...

    In 99 milion possibles and 1 milion numbers, with the presorted list the largest difference is 99 which easily fits in one byte and there are couple of bits more which can be used to store keyframes. the downslide is that you always have to start from the begining (or the last keyframe) in order to return the exact number.

  • 11-22-2008 2:07 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Going back to the whole "sort the list and store only differences" idea, let's think of this a different way: an eight digit decimal number can store a result from 0 to 99,999,999. We're guaranteed there are 1 million phone numbers, which means that the greatest difference that could occur between two numbers is 98,999,999. That can be stored in 27 bits, but this is impossible - processors aren't designed to work with odd numbers of bits. What's needed is some way to use arbitrary-sized numbers and make sure the program knows where the dividing lines are.

    32 bits will store any number in our range, so it's the maximum. Then we could have three bytes, two bytes, or one byte as the other possible dividing lines. Let's suppose we precede the file with a series of 1 million two-bit flags indicating the length of the corresponding term. This adds up to 250 kilobytes, which could be squeezed into the memory of a modest computer. Then it just runs through the list, more or less.

    The problem is that this approach could never actually get below a megabyte. But it's a start. Any other ideas?

  • 11-22-2008 6:53 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Well, try this:

    * Sort the numbers

    * Convert to a list of differences from the previous number

    * Huffman code this list

    The chances are that most of the differences will be 1, which should code as a single bit. The entropy of the whole list should easily be less than 8 bits.

    Searching would be slow as you would need to scan from the start, but on any modern CPU that would still be very quick.

    Linux is not a code base. Or a distro. Or a kernel. It's an attitude. And it's not about Open Source. It's about a bunch of people who still think vi is a good config UI.

    Notice: Phorm, and its agents including ISPs collecting data on Phorm's behalf, are specifically forbidden from performing any processing or monitoring of the content of the above post. Hence, under the Regulation of Investigatory Powers Act 2000 any such attempt to profile this page by Phorm or its agents is illegal.
  • 11-22-2008 6:01 PM In reply to

    • Nelle
    • Top 100 Contributor
    • Joined on 11-08-2007
    • graz.at.earth.milkyway.universe
    • Posts 285

    Re: Phone book of 1 000 000 entries in 1Mb

    glatapoui:
    I guess one must use bit arrays, but I don't know how to go under this limit of 1Mb...

    glatapoui:
    If I remember correctly, it must be searchable (for example : give me the phone number #666 of 1 000 000 :)
     

    First of all thanks for posting this problem, this one was a really nice headbreaker ... 

    Presort the list  and store it in the following format :

    struct telephonebook_block {
    keyframe : 27;
    diff1 : 7;
    ...
    diff27 : 7;
    }

    one such block can hold 28 telephone numbers and requires 27 bytes to store them ...

    to store 1.000.020 numbers, you need 35 715 blocks and take up  964305 bytes ...

    the lookup is pretty fast since you only do 27 diff. additions in the worst case scenario ...

    the tricky part is getting the number since you can not create an array of bitfields (at least in C) so the solution would go something like this :

    struct strkeyframe
    {
    unsigned keyframe : 27;
    };

    struct strdiff
    {
    unsigned diff : 7;
    };

    uint32 get_number_at ( uint32 index, uchar8 * telephonebook )
    {
    uint32 return_value = 0;
    uint32 keyframe_index = index / 27;
    uint32 number_index = index % 27;
    uchar8 * block_start = (telephonebook + keyframe_index * 27); // --->
    uchar8 * diff_start = block_start + 2; // +++>

    // get keyframe
    strkeyframe * pkeyframe = (strkeyframe *) blockstart;
    return_value = (uint32) pkeyframe->keyframe;

    // got lucky
    if (!number_index) return return_value;

    // get the diffs
    for (uint32 diff_idx = 0; diff_idx < number_index; diff_idx++) {
    uint32 byte_step = (7 * diff_idx + 3) / 8;
    uint32 numblock = *( (uint32 *) (diff_start + byte_step) );
    numblock = numblock << ((7 * diff_idx - 4) % 8);

    return_value += (uint32) ((strdiff *) & numblock)->diff;
    }

    return return_value;
    }
  • 11-22-2008 8:18 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Nelle:
    one such block can hold 28 telephone numbers and requires 27 bytes to store them
    You only have 7 bits to store the difference. What if the first number is 00 000 000 and the second number is 00 000 129?

    If you're using a different keyframe for 00 000 129, then you're using 27 bytes to store 00 000 000. (27 bits for the keyframe 00 000 000 and all 27 7-bit fields have to be 0 (indicating this frame is over). All you need is 37038 (37038 numbers taking 27 bytes of space each > 1 megabyte) consecutive phone numbers differing by 129 and you're over 1 megabyte. 37038 * 129 = 4 777 902, which is well within the domain of 8-digit phone numbers.

    You made the assumption somewhere that no two consecutive can differ by more than 99, when the difference can be as much as 99 000 001 (first number is 00 000 000, second is 99 000 001, remaining numbers count up until 99 999 999). The average difference will be 100, but that's not a maximum. That's why I had brought up the fact that to store the differences you'd have to use a delimited storage mechanism to account for the variability in differences.

  • 11-22-2008 11:17 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Here's a way that's guaranteed to work, if you don't mind violating a few patents.

    Make a bitmap of all 100000000 possible phone numbers. 99% are not present, 1% are present. Now use arithmetic coding to compress the bitmap, which will store each "not present" in -log2(0.99) = .0144995696951151 bits, and each "present" in -log2(0.01) = 6.64385618977472 bits.

    In total, this will take 99000000 * -log2(0.99) + 1000000 * -log2(0.01) = 8079314 bits, which is 1009915 bytes or about 0.96 megabytes.

  • 11-23-2008 12:02 PM In reply to

    • Nelle
    • Top 100 Contributor
    • Joined on 11-08-2007
    • graz.at.earth.milkyway.universe
    • Posts 285

    Re: Phone book of 1 000 000 entries in 1Mb

    Welbog:
    You made the assumption somewhere that no two consecutive can differ by more than 99, when the difference can be as much as 99 000 001 (first number is 00 000 000, second is 99 000 001, remaining numbers count up until 99 999 999).
     

    you're right ... i wanted to avoid scaning from the begining, but it seems impossible ...

    Welbog:
    That's why I had brought up the fact that to store the differences you'd have to use a delimited storage mechanism to account for the variability in differences.

     Well GettinSadda and Goplat suggested Huffman coding and aritmetic coding, the question now is only how large would the decoding table be or would the encoded data + table fit in the 1 Mb ...


     

  • 11-23-2008 1:07 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Nelle:
    Well GettinSadda and Goplat suggested Huffman coding and aritmetic coding, the question now is only how large would the decoding table be or would the encoded data + table fit in the 1 Mb ...
    I can't speak for Huffman, but Goplat's idea looks completely feasible.

    Goplat:
    In total, this will take 99000000 * -log2(0.99) + 1000000 * -log2(0.01) = 8079314 bits, which is 1009915 bytes or about 0.96 megabytes.
    We have .04 megabytes to store the fact that [0-.99) -> 0 and [.99,1) -> 1. That seems like a trivially easy thing to store.

    Unless we're counting the size of the algorithm required to decode the numbers, I think this is a safe bet.

    The Huffman idea will probably cut down to under a megabyte in the average cases, but if there's a wide variety of distinct differences between the phone numbers, the dictionary is going to get large and the encodings themselves will get too large.

    I cast my vote for arithmetic encoding, except it doesn't preserve initial order because it requires what amounts to sorting. That wasn't part of the specification of the problem so hopefully it's OK.

  • 11-24-2008 5:19 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Yup, it seems that arithmetic coding will, just, manage it:

    Zero ratio = 99000000:1000000

    Random data generated in 9.48 seconds
    Encoder test completed in 4.74 seconds
    Decoder test completed in 5.20 seconds

    Used 1009922 bytes for coding 100000000 symbols

    Linux is not a code base. Or a distro. Or a kernel. It's an attitude. And it's not about Open Source. It's about a bunch of people who still think vi is a good config UI.

    Notice: Phorm, and its agents including ISPs collecting data on Phorm's behalf, are specifically forbidden from performing any processing or monitoring of the content of the above post. Hence, under the Regulation of Investigatory Powers Act 2000 any such attempt to profile this page by Phorm or its agents is illegal.
  • 11-24-2008 8:02 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Welbog:
    We're talking about this in IRC (#tdwtfmafia on irc.slashnet.org if you want to hop in).

    Thank you for the invite, but I was at work when I posted this. I couldn't just hop in. Anyway with the quality of the answers I saw here, I don't know if I would have been of any help.

    Welbog:
    But it's a pretty crazy question in the first place.

    Yes, I do agree. But I think it was more to check if I'm capable of seeing the problems that could come with such constraints and if I can program in a way that do not consummate too much resources (it was for a Game Developer position)

    PJH:

    Welbog:
    specialization of only dealing with the characters 0-9 (so 4 bits per character).

    And the phrase 'binary coded decimal' (BCD) didn't rear it's head?


    Thanks for the acronym, I guess I learned it somewhen but I could not remember of it. That was exactly what I started to use. But it wasn't enough.

    Autonuke:
    The only way I can think of is like this: there are 10^8 C 10^6 = 10^8! / (10^6! * (10^8 - 10^6)!) possible ways of choosing 1 million distinct 8-digit numbers in, say, ascending order. That's a number that requires slightly more than 8 million bits (Python and me make it 8079302.3053085813), less than a megabyte (okay, 'mebibyte'). So there's a unique 1-megabyte number for every possible list, and a unique list for (nearly) every 1-megabyte number. Of course, you'd have to do a bit of processing to actually search for the n-th phone number in a list encoded this way.

    Nice demonstration. I am not very used to handle probabilities, so these types of proof always make me think : "Oh yes, that sounds right, but how come did he find which number to use ?" ;-)

    Autonuke:
    Strange interview question though. How much time did they allow you?

    I think we talked about it for 20 minutes. But it was rather a question/answer discussion than a coding test.

    Nelle:
    First of all thanks for posting this problem, this one was a really nice headbreaker ...

    You're welcome. But I would rather like to thank YOU all, because of your responsiveness, your brilliant ideas, and your enthusiastic welcome !

    I'm sorry I can't answer all your posts, but the ideas and the concepts mentionned here made a good gymnastic to my brain :) If I knew some of them, I discovered others (or variants of some I knew) and I find that just fantastic. Again, thank you all.

     Glat'.

     

     

  • 11-27-2008 3:13 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Nelle:

    it all boils down to the entropy of the list, but in the worst case scenario (IMHO) it is impossible ...

    I think this has to be the correct answer. Without knowing more about the data set, you can't guarantee that it will fit into x number of bytes. Or maybe the real point of the question is actually: what compression scheme could you devise that would be more efficient than just zipping the file?
  • 11-27-2008 3:36 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    savar:
    I didn't read the thread at all
    Then why post in it?

  • 11-28-2008 1:22 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Welbog:

    savar:
    I didn't read the thread at all
    Then why post in it?

    I did read the entire thread... what's your point? One person mentioned entropy... every other idiot in the thread tried to invent clever compression algorithms that quite simply can't guarantee the desired result.
  • 11-28-2008 2:22 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    The original problem only specifies that you have 1,000,000 numbers that are each 8 digits long that need to be looked up in order. (E.g. what is the 500th number in the list?) So... you can't sort the numbers unless you also create some table that keeps track of the position for each number in the original list. Knowing that, there is no autocorrelation in this sequence of numbers. It could be 00000000, 99999999, 3928978, etc. So storing the differences between numbers doesn't seem very effective either. The entropy for an 8 digit number is 30 bits. This means that a perfect compression scheme will compress each phone number to exactly 30 bits -- not counting any lookup tables required to perform the decompression. 30 bits = 3.75 bytes 3.75 bytes * 1 million phone numbers = 3.75 million bytes 3,750,000 bytes / (1024^2) =~ 3.58 MB Without any further assumptions, an arbitrary list can't be compressed to less than 3.58 MB. IOW, no matter what compression scheme you define, there exists a list of phone numbers that will still use at least 3.58 MB. If the interviewer was looking for a concrete answer or --shudder-- pseudo code, then the question is poor and the company is a WTF. In this case, the "ideal" candidate makes some unfounded assumptions and generates an algorithm that works on a limited set of inputs but fails dramatically on a large range of inputs which they never tested. Most programmers have this mindset. If the point was to see how a person approaches a problem, then my original answer still stands. In that case, the question is reasonable for an interview since it challenges the interviewee's ability to grok requirements and refine the problem specification, or else make a reasoned explanation why the client's request cannot be done.
  • 11-28-2008 3:14 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    savar:
    I did read the entire thread... what's your point?
    My point is someone came up with a solution that is guaranteed to work. I suggest reading the thread again, this time paying special attention to

    Goplat:
    Here's a way that's guaranteed to work, if you don't mind violating a few patents.

    Make a bitmap of all 100000000 possible phone numbers. 99% are not present, 1% are present. Now use arithmetic coding to compress the bitmap, which will store each "not present" in -log2(0.99) = .0144995696951151 bits, and each "present" in -log2(0.01) = 6.64385618977472 bits.

    In total, this will take 99000000 * -log2(0.99) + 1000000 * -log2(0.01) = 8079314 bits, which is 1009915 bytes or about 0.96 megabytes.

    And
    GettinSadda:
    Yup, it seems that arithmetic coding will, just, manage it:

    Zero ratio = 99000000:1000000

    Random data generated in 9.48 seconds
    Encoder test completed in 4.74 seconds
    Decoder test completed in 5.20 seconds

    Used 1009922 bytes for coding 100000000 symbols

  • 11-29-2008 3:43 PM In reply to

    • PJH
    • Top 10 Contributor
    • Joined on 02-14-2007
    • Newcastle, UK
    • Posts 1,256

    Re: Phone book of 1 000 000 entries in 1Mb

    savar:
    Welbog:

    savar:
    I didn't read the thread at all
    Then why post in it?

    I did read the entire thread
    Either QOOC is in play, or you can't make up your mind.

     Which is it?

    Abstinence makes the Church grow fondlers.

    - unknown
  • 12-01-2008 12:08 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    PJH:

    savar:
    Welbog:

    savar:
    I didn't read the thread at all
    Then why post in it?

    I did read the entire thread
    Either QOOC is in play, or you can't make up your mind.

     Which is it?

    Neither. Welbog intentionally misquoted me to be funny. Read back to what I actually wrote: I proposed that the best answer is actually "not enough information".

    Welbog:
    My point is someone came up with a solution that is guaranteed to work. I suggest reading the thread again

    Thanks for the recap, but again you're missing my point. I read the entire thread, I even read and considered the posts you're referring to, and as you commented yourself:

    Welbog:
    I cast my vote for arithmetic encoding, except it doesn't preserve initial order because it requires what amounts to sorting. That wasn't part of the specification of the problem so hopefully it's OK.

    The problem spec is brief, but one of the constraints that is explicitly mentioned (in the 2nd post), is:

    glatapoui:
    Well, it's been a while... If I remember correctly, it must be searchable (for example : give me the phone number #666 of 1 000 000 :)

    Moreover, nowhere in the problem does it even say that the phone numbers are unique, which would shoot the bitmap to shit anyway.

    That's a requirement that everybody assumed because the original problem used the term "phone book"... but given that a list of 1,000,000 unlabeled and unordered numbers doesn't resemble any kind of phone book I've ever heard of, I didn't make the same assumption.

    The arithmetic encoding is a cool solution, but it solves a different problem. Maybe there was some refining of the requirements going on in IRC which I'm not privy too, but I stand by my original statement. I don't think the problem is solvable in its current form.

    So I still think the best answer to this question (if I was the interviewer) would be for the applicant to say, "Well as you state it, it's not possible... here's why... but if we make some reasonable assumptions like this and that, then the following solution would work."

    As somebody who manages other developers, I'm constantly frustrated by how often programmers create their own assumptions to a problem and engineer a solution that doesn't work, instead of simply pointing out that the requirements didn't make sense in the first pace.

  • 12-01-2008 12:22 AM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    Here's my previous breakdown, posted above but re-formatted here since the forums apparently don't turn line breaks into paragraph tags...

    The original problem only specifies that you have 1,000,000 numbers that are each 8 digits long that need to be looked up in order. (E.g. what is the 500th number in the list?) So... you can't sort the numbers unless you also create some table that keeps track of the position for each number in the original list.

    Knowing that, there is no autocorrelation in this sequence of numbers. It could be 00000000, 99999999, 3928978, etc. So storing the differences between numbers doesn't seem very effective either.

    The entropy for an 8 digit number is 30 bits. This means that a perfect compression scheme will compress each phone number to exactly 30 bits -- not counting any lookup tables required to perform the decompression.

    30 bits = 3.75 bytes

    3.75 bytes * 1 million phone numbers = 3.75 million bytes

    3,750,000 bytes / (1024^2) =~ 3.58 MB

    Without any further assumptions, an arbitrary list can't be compressed to less than 3.58 MB. IOW, no matter what compression scheme you define, there exists a list of phone numbers that will still use at least 3.58 MB.

    If the interviewer was looking for a concrete answer or --shudder-- pseudo code, then the question is poor and the company is a WTF. In this case, the "ideal" candidate makes some unfounded assumptions and generates an algorithm that works on a limited set of inputs but fails dramatically on a large range of inputs which they never tested.

    Most programmers have this mindset. If the point was to see how a person approaches a problem, then my original answer still stands. In that case, the question is reasonable for an interview since it challenges the interviewee's ability to grok requirements and refine the problem specification, or else make a reasoned explanation why the client's request cannot be done.

  • 12-30-2008 3:12 PM In reply to

    Re: Phone book of 1 000 000 entries in 1Mb

    If the numbers can be sorted (and have no duplicates), then how many distinct lists are possible? I'm not sure how to go about calculating that. If and only if there are fewer than 2^23 such lists, then the problem is possible. Otherwise, it is impossible.

     

    TRWTF is Community Server
Page 1 of 1 (34 items)
Powered by Community Server (Non-Commercial Edition), by Telligent Systems