Extracting address information from non-formatted text

Posted on

Problem

I’m pretty new to large projects, and usually my code is neat and manageable. But now that I’m about a month deep, things are starting to seem sloppy (or at least not aesthetically pleasing).

Are there any good methods/things I should keep in mind when designing a function that is compact and neat?

Function is question: ‘postal’, ‘streettypecheck’, and ‘namecheck’ are regex

function addresscheck(serverpackage) {
    var i;
    var streetnumber = [];
    var postalcode = [];
    var potentialname = [];
    var streettype = [];
    var highlightdata = serverpackage.split(' ');
    for (i = 0; i <= highlightdata.length; i++) {
        if(!isNaN(highlightdata[i])) {
            streetnumber.push(highlightdata[i]);
        } else if (postalcheck.test(highlightdata[i])) {
            if (highlightdata[i].indexOf('-') !== -1 || highlightdata[i].indexOf(" ") !== -1 && highlightdata[i].length == 7) {
                postalcode.push(highlightdata[i]);
            } else if (highlightdata[i].indexOf('-') == 1 || highlightdata[i].indexOf(' ') && highlightdata[i].length == 6) {
                postalcode.push(highlightdata[i]);
            }
        } else if (streettypecheck.test(highlightdata[i])) {
            streettype.push(highlightdata[i]);
        } else if (namecheck.test(highlightdata[i])) {
            potentialname.push(highlightdata[i]);
        }
    }
    var addressinfo={
        postalcode: postalcode,
        streetnumber: streetnumber,
        streettype: streettype,
        potentialname: potentialname
    }
    return addressinfo;
}

Consider that all text is non formatted.

Sample data1: Street st jksh7dyasf B7Z-8U0 aaa

Sample data2: 166623gggdgdgdgdf 123 88f8f88d88dydydy B7Z-8U0 Street st 8888 aaaa

Solution

Loop variables

The first thing you can do to improve both the readability and the performance of your code is to use a loop variable instead of highlightdata[i]. If you’re developing for more modern browsers (= after IE11), you can use a for .. of loop:

var highlightdata = serverpackage.split(' ');
for (var token of highlightdata) {
    if (!isNan(token)) {
        // ...etc.

If you need to remain backwards compatible, you can just declare the variable inside the loop. Also note, you should declare the index variable (i) in the loop header as well.

var highlightdata = serverpackage.split(' ');
for (var i = 0; i <= highlightdata.length; i++) {
    var token = highlightdata[i];

    if (!isNaN(token)) {
        // ...etc.

Smaller functions

Ideally, you’ll want your functions to be as self-explanatory as possible. In the case of longer functions that perform multiple steps, this often means “outsourcing” the individual steps to smaller functions whose names describe what happens in that step. In your case, the main function could be rewritten as:

function checkAddress(serverPackage) {
    var streetNumber = [];
    var postalCode = [];
    var potentialName = [];
    var streetType = [];

    for (var token of serverPackage.split(' ')) {
        if (isStreetNumber(token)) {
            streetNumber.push(token);
        }
        else if (isPostalCode(token)) {
            postalCode.push(token);
        }
        else if (isStreetType(token)) {
            streetType.push(token);
        }
        else if (isName(token)) {
            potentialName.push(token);
        }
    }

    return {
        postalCode: postalCode,
        streetNumber: streetNumber,
        streetType: streetType,
        potentialName: potentialName,
    };
}

…and then you can define the individual functions isStreetNumber, isPostalCode, isStreetType and isName, all of which expect a single string arugment, and return a boolean value:

function isStreetNumber(token) {
    return !isNan(token);
}

function isPostalCode(token) {
    return (
        postalCheck.test(token)
        && (
            (token.indexOf('-') !== -1 || token.indexOf(" ") !== -1 && token.length == 7)
            ||
            (token.indexOf('-') == 1 || token.indexOf(' ') && token.length == 6)
        )
    );
}

function isStreetType(token) {
    return streetTypeCheck.test(token);
}

function isName(token) {
    return nameCheck.test(token);
}

(side note: are you sure your logical operators are grouped correctly? && has precedence over || much like multiplication has precedence over addition, so a || b && c is evaluated as a || (b && c), rather than (a || b) && c. It’s generally a good idea to parenthesize compound logical expressions, just to be safe.)

Naming conventions

You will also notice that I’ve changed most of your variable names. In JavaScript, we generally use CamelCase; this has absolutely no impact on your code’s performance, but it’s generally more legible than all lowercase. If you’re going to collaborate with other developers, they’ll most likely expect you to follow this convention. I’ve also renamed your main function from addresscheck to checkAddress; again, the convention is to start function names with the verb.

BUG

First you split by space and then you test if those chunks contain it. The answer will always be no. From the fact that you still test for it however, I guess that you actually expect them to be there, so some valid postal codes will not be detected.

Mistakes

  • You iterate one too many times,
  • In … || highlightdata[i].indexOf(' ') && … you probably forgot == 1,
  • Due to precedence of && over || and lack of parentheses your logical statement works differently than you probably intended,

Remarks

  • Use of for…of would benefit readability and performance,

  • No need to define 4 array variables that will later end up in an object — create an object with 4 arrays right away and push there directly,

  • Use strict equality === operator — it performs no type conversion and you avoid potential pitfails,

  • Use const and let rather than var — they won’t exist in the entire execution context and enable performance optimization,

  • highlightdata[i].indexOf('-') == 1 will at most go through the entire string — use highlightdata[i].charAt(1) === '-' instead,

  • !highlightdata[i].includes('-') is more idiomatic and concise than highlightdata[i].indexOf('-') !== -1,

  • No need to declare var i before the loop,

  • Naming convention: use camelCase,

  • 2 space and line break after } are more popular formatting choices,

  • Be consistent with the kind of quotes you use.

Rewrite

Effectively it works like the original code. Therefore it is broken as well.

const checkAddress = serverPackage => {
  const info = {postal: [], streetNumber: [], streetType: [], potentialName: []};

  for (const token of serverPackage.split(' ')) {
    if(!isNaN(token)) {
      info.streetNumber.push(token);
    }
    else if (postalCheck.test(token)) {
      if (token.charAt(1) === '-' || token.length === 6) {
        info.postal.push(token);
      }
      else if (!token.includes('-') || token.length === 7) {
        info.postal.push(token);
      }
    }
    else if (streetTypeCheck.test(token)) {
      info.streetType.push(token);
    }
    else if (nameCheck.test(token)) {
      info.potentialName.push(token);
    }
  }

  return info;
};

Leave a Reply

Your email address will not be published. Required fields are marked *