Optimize cURL requests to Google API

Posted on

Problem

I have a table in MySQL with approx. 20 million rows and the following columns:

id | word_eng | word_indic

I have to translate the English word (the value in column word_eng) into Indian language (the value in column word_indic) using the Google Translate API.

I have written PHP code which spawns multiple cURL requests and fetches data from API and updates it into the table. But this process is quite slow; it takes about 100 to 200 words per second.

I am using RollingCurl for multiple cURL requests. I am running this as a cron job.

What’s the best way to make it as fast as possible?

<?php
include_once('db.php');
include_once('functions.php');
include_once('rolling-curl-master/RollingCurl.php');

$table = $argv[1];

$q = "SELECT * from $table where word_indic is null limit 500000";
$result = $conn->query($q); $n = 0;

$urls = array();

while ($row = $result->fetch_assoc())
{
    $id = $row['id']; 
    $word = rawurlencode(getName($row['name_eng']));

//getName is a simple function which does some trimming and cleaning up of string

    $url = 'https://www.google.com/inputtools/request?text='.rawurlencode($word).'&ime=transliteration_en_te&id='.rawurlencode($id);
    array_push($urls, $url);
}
//print_r($urls);

unset($url);


$rc = new RollingCurl("request_callback");

// the window size determines how many simultaneous requests to allow.  
$rc->window_size = 300;

foreach ($urls as $url) 
{
    // add each request to the RollingCurl object
    $request = new RollingCurlRequest($url);
    $rc->add($request);
}
$rc->execute();

function request_callback($response, $info) 
{
    // parse the page title out of the returned HTML
    if (preg_match("~<title>(.*?)</title>~i", $response, $out)) {
        $title = $out[1];
    }

    //echo "<b>$title</b><br />";
    //print_r($info);
    $parts = parse_url($info['url']); 
    parse_str($parts['query'], $query);
    $id = $query['id'];
    $text = $query['text'];

    //echo "<hr>";
    $trans = json_decode($response)[1][0][1][0];    

    global $conn; global $table; global $urls; global $n;   
    if ($trans != '' and !preg_match('/[a-z]/', $trans)) 
    { 
        $conn->query("update $table set word_indic='$trans' where id='$id'"); $n++;
    }
} 

?>

Table structure:

CREATE TABLE `words` (
 `id` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
 `word_eng` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
 `word_indic` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
 PRIMARY KEY (`id`),
 FULLTEXT KEY `word_eng` (`word_eng`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

Solution

Url building

$id = $row['id']; 
$word = rawurlencode(getName($row['name_eng']));
//getName is a simple function which does some trimming and cleaning up of string
$url = 'https://www.google.com/inputtools/request?text='.rawurlencode($word).'&ime=transliteration_en_te&id='.rawurlencode($id);

Could be simplified with the use of http_build_query() with encoding type PHP_QUERY_RFC3986 to replace the rawurlencode:

$row['text'] = getName($row['name_eng']);
//unset $row['name_eng'] or select it as "text" in the SELECT statement
$row['ime'] = 'transliteration_en_te'; //or move into SELECT fields in query?
$url = 'https://www.google.com/inputtools/request?'.http_build_query($row, '', '&', PHP_QUERY_RFC3986);

Obviously that will likely result in extra parameters tacked onto the URL (e.g. word_indic, word_eng). You could update the select query to specifically select those fields:

$q = "SELECT id, name_eng as text from $table where word_indic is null limit 500000";

And you could select that ime parameter as a hard-coded value, so as to not need to assign it to each row in the while, but then that would make the result set larger…

Global variables

I see this line:

global $conn; global $table; global $urls; global $n;  

In general it isn’t a good habit to be referencing variables globally. Obviously this is in the context of a callback function but there are ways to avoid that. I would suggest creating a wrapper class and storing the connection, table and other info in member/instance variables while calling the method using the array syntax.

The Singleton Pattern could be used to create a wrapper class on the database code and call a static method to get a connection to the database, instead of referencing $conn globally.

Also, $urls and $n don’t appear to be used after that line anyway…

Leave a Reply

Your email address will not be published. Required fields are marked *