Converting a raw list of IDs to a consolidated array of ids [closed]

Posted on

Problem

I feel I have overcomplicated the process. What can I do to improve the following code?

ids = []

File.open(filePath, "r").each_line do |line|
  i += 1

  parts = line.split(',')

  parts[1] = parts[1].to_i
  parts[2] = parts[2].to_i

  # If the id doesn't exist in our ids array then create it and add the
  # second id else delete the second id and merge it into the new id
  unless defined? ids[parts[2]]
    # Instantiate the array first
    ids[parts[1]] = [] unless defined? ids[parts[1]]

    # Add the new id to the array
    ids[parts[1]].push(parts[2])
  else
    # Move over older ids that matched these to the new one
    if ids[parts[1]].nil?
      unless ids[parts[2]].nil?
        ids[parts[1]] = ids[parts[2]]
      else
        ids[parts[1]] = [parts[2]]
      end
    else
      unless ids[parts[2]].nil?
        ids[parts[2]].each do |id|
          ids[parts[1]].push(id)
        end
      end
    end

    # Insert the old id into the new array
    ids[parts[1]].push(ids[parts[2]])

    # Delete the old one
    ids.delete(parts[2]) unless ids[parts[2]].nil?
  end
end

The file I am processing is being generated by a deduping library, Duke.

I am trying to consolidate the matches it has generated into a grouped list of matched ids, so all common ids are grouped together. The file itself is 2.4G.

The point of consolidating this is that these are many purchases that have been ran through a deduplication process to product a single customer. I feel it would be easier to consolidate this list rather than building up the customer and continually add purchases to it. This is a one time process.

Here’s the first 100 lines from the file I’m consolidating. The format of which is:

MATCH (-/+), ID1, ID2, PROBABILITY %

+,6,7,0.9997260166738423
+,3,4,0.9997260166738423
+,8,9,0.9997260166738423
+,9,8,0.9997260166738423
+,7,6,0.9997260166738423
+,6,10,0.9997260166738423
+,4,3,0.9997260166738423
+,10,6,0.9997260166738423
+,8,296,0.9986244841815681
+,6,39,0.9983391066412223
+,8,299,0.9986244841815681
+,7,10,0.9997260166738423
+,6,40,0.9983391066412223
+,9,296,0.9986244841815681
+,8,1101,0.9986244841815681
+,10,7,0.9997260166738423
+,6,1081,0.9983391066412223
+,6,1083,0.9983391066412223
+,10,39,0.9983391066412223
+,10,40,0.9983391066412223
+,10,1081,0.9983391066412223
+,10,1083,0.9983391066412223
+,8,1125,0.9997260166738423
+,8,1128,0.9997260166738423
+,9,299,0.9986244841815681
+,8,1132,0.9997260166738423
+,9,1101,0.9986244841815681
+,9,1125,0.9997260166738423
+,9,1128,0.9997260166738423
+,7,39,0.9983391066412223
+,8,1144,0.9986244841815681
+,8,1149,0.9986244841815681
+,7,40,0.9983391066412223
+,9,1132,0.9997260166738423
+,7,1081,0.9983391066412223
+,7,1083,0.9983391066412223
+,9,1144,0.9986244841815681
+,9,1149,0.9986244841815681
+,12,24781,0.9997260166738423
+,11,16,0.9999872532595235
+,17,15,0.9997260166738423
+,16,11,0.9999872532595235
+,7,36,0.9977532413823246
+,6,36,0.9977532413823246
+,10,36,0.9977532413823246
+,15,17,0.9997260166738423
+,18,560,0.99892382632337
+,59,56,0.9997260166738423
+,37,36,0.9997260166738423
+,37,1333,0.9997260166738423
+,37,1341,0.9997260166738423
+,37,12479,0.9997260166738423
+,37,19462,0.9997260166738423
+,37,19466,0.9997260166738423
+,70,64,0.9997260166738423
+,70,106,0.9997260166738423
+,27,8200,0.9999217037269025
+,28,397,0.9981390956560382
+,27,8229,0.9999217037269025
+,49,145,0.9991644138608996
+,49,19596,0.9998336350736409
+,49,250925,0.9991644138608996
+,64,70,0.9997260166738423
+,64,106,0.9997260166738423
+,26,22,0.9999217037269025
+,27,66061,0.9990737249986892
+,27,69613,0.9990737249986892
+,27,69617,0.9990737249986892
+,27,70011,0.9990737249986892
+,23,613,0.9999217037269025
+,27,70885,0.999849094020817
+,23,1186,0.9999217037269025
+,77,87,0.9997260166738423
+,23,1274,0.9999217037269025
+,22,26,0.9999217037269025
+,23,7603,0.9999217037269025
+,27,70946,0.9990737249986892
+,23,7759,0.9996066089693157
+,23,7766,0.9996066089693157
+,95,100,0.9997260166738423
+,95,12510,0.9997260166738423
+,23,12437,0.9996066089693157
+,23,12455,0.9996066089693157
+,23,32083,0.9999217037269025
+,39,40,0.9997260166738423
+,39,1081,0.9997260166738423
+,40,39,0.9997260166738423
+,39,1083,0.9997260166738423
+,40,1081,0.9997260166738423
+,40,1083,0.9997260166738423
+,39,36,0.9996291897994778
+,39,37,0.9996291897994778
+,40,36,0.9996291897994778
+,39,1333,0.9996291897994778
+,40,37,0.9996291897994778
+,39,1341,0.9996291897994778
+,40,1333,0.9996291897994778
+,39,1352,0.9997851370603088
+,114,108,0.9997260166738423
+,40,1341,0.9996291897994778

Solution

Your code is very, very broken. Technically, that means it’s not a question for CodeReview. However, I’ll review it anyway, only because it’s, well, it’s fascinatingly broken.


Name your variables. Right now, it’s really difficult to keep track of things, because parts[1] and parts[2] is pretty obtuse. They’re obviously integer IDs, but what’s their relation to each other?

Reading it felt something like this:

There’s an ID and an ID and an array, and if the ID – yeah, that one – is in the array, then add the ID – no, that one – but if the ID isn’t… wait

A comment like this doesn’t really help:

# If the id doesn't exist in our ids array then create it and add the
# second id else delete the second id and merge it into the new id

I’m sorry, but what? What is “the id”, and “the second id”, and “the new id”? All I have is parts[1] and parts[2].

Secondly, don’t use unless...else. Ruby’s unless is great for single conditional statements like doStuff unless x, but if you’re going to have an else branch, just use a regular old if..else. Otherwise the else branch reads like a double-negative: “If not, do this; else if not not, do this”. It gets confusing in a hurry.

Now, as to why it’s actually broken, and not just hard to read:

defined? doesn’t do what you think it does. If you’ve got an array a = [1,2,3], and you try defined?(a[21315]) what do you get? The string "method". Which is truthy. So the entire first branch never runs. Checking for nil? is what you want to be doing, and you do that in other places – which, even if both approaches worked, also makes the code inconsistent.

And here’s why it’s really broken: Array#delete. It, as the name suggests, deletes things. It removes the given index entirely, shifting all following indices one place. And you’re relying on ids being equal to array indices. See the problem? You may have assigned something to ids[23], but then you went and called ids.delete(10), and what you previously assigned as ids[23] is now ids[22]. And the next time you go look up ID 23, you’ll get get what used to be ids[24].

So even if the code somehow worked, and chewed through your file, you’d get very, very wrong results.

Oh, and by the way, what is i doing? Why are you incrementing it?

Overall, I can’t tell you what to do instead, because frankly I’m not even sure what it’s supposed to do right now.

Leave a Reply

Your email address will not be published. Required fields are marked *