Getting the Twitter account from various input formats

Posted on

Problem

I have an input for users to enter a Twitter account in any different way and I want to extract the user account.

For example:

twitters = [
    "www.twitter.com/twitteruser1",
    "@twitteruser2",
    "twitteruser3",
    "https://twitter.com/twitteruser4",
    "https://www.twitter.com/twitteruser5",
    "www.twitter.com/twitteruser6",
    "http://www.twitter.com/twitteruser7",
    "http://www.twitter.com/twitteruser8",
    "twitter.com/twitteruser9"
]

The script that I’ve written to extract the data is the following:

twitters.each do |twitter|
    # for the url
    twitter_user = twitter.match(/twitter.com/([^/.]*)$/)
    if twitter_user != nil
        puts twitter_user[1]
        next
    end

    # for @ beginning
    twitter_user = twitter.match(/^@([^/.]*)$/)
    if twitter_user != nil
        puts twitter_user[1]
        next
    end

    # if we arrive, we haven't found any coincidence
    puts twitter

end

It actually works, outputting the following:

twitteruser1
twitteruser2
twitteruser3
twitteruser4
twitteruser5
twitteruser6
twitteruser7
twitteruser8
twitteruser9

But as I’m really newbie in Ruby I wanted to check for possible improvements.

Solution

I would put the regular expressions into a list:

TWITTER_PATTERNS = [
  /twitter.com/([^/.]*)$/,    # "www.twitter.com/twitteruser1"
  /^@([^/.]*)$/,               # "@twitteruser2"
]

This function can iterate over that list:

def twitter_user(twitter)
  TWITTER_PATTERNS.each do |pattern|
    return $1 if twitter =~ pattern
  end
  twitter
end

This clearly separates the “policy” (what patterns are used to extract twitter users) from the “mechanism” (the loop we use to apply the policy). A change to the function’s implementation won’t require a change to the patterns, and vice versa.

You can use a ruby case statement to check for different regex matches. If a match is found, it is by definition not nil, so it removes the need for a nil check as well.

twitters.each do |twitter|
  case twitter
    # for the url
    when /twitter.com/([^/.]*)$/
      puts $1

    # for @ beginning
    when /^@([^/.]*)$/
      puts $1

    # if we arrive, we haven't found any coincidence
    else
      puts twitter
  end
end

When ruby does a regex match, any capture groups are assigned to the global variables $1, $2, $3, and so on. See this question on StackOverflow for more details about the mechanic.

Note on your regex: If you want the capture groups to actually contain anything, you should change the asterisk to a plus: ([^/.]+). Without the plus, “www.twitter.com/” will be captured by one of the regex. With the plus, that string will fall into the default case.

twitters.each do |twitter|
    puts twitter.match(/twitter.com/([^/.]*)$/) ||
         twitter.match(/^@([^/.]*)$/) ||
         twitter
end

It’s shorter, and I’d say at least equally readable. Chain of || operators will return first truthy value.

Leave a Reply

Your email address will not be published. Required fields are marked *