Problem
I have a table in MySQL with approx. 20 million rows and the following columns:
id | word_eng | word_indic
I have to translate the English word (the value in column word_eng
) into Indian language (the value in column word_indic
) using the Google Translate API.
I have written PHP code which spawns multiple cURL requests and fetches data from API and updates it into the table. But this process is quite slow; it takes about 100 to 200 words per second.
I am using RollingCurl
for multiple cURL requests. I am running this as a cron job.
What’s the best way to make it as fast as possible?
<?php
include_once('db.php');
include_once('functions.php');
include_once('rolling-curl-master/RollingCurl.php');
$table = $argv[1];
$q = "SELECT * from $table where word_indic is null limit 500000";
$result = $conn->query($q); $n = 0;
$urls = array();
while ($row = $result->fetch_assoc())
{
$id = $row['id'];
$word = rawurlencode(getName($row['name_eng']));
//getName is a simple function which does some trimming and cleaning up of string
$url = 'https://www.google.com/inputtools/request?text='.rawurlencode($word).'&ime=transliteration_en_te&id='.rawurlencode($id);
array_push($urls, $url);
}
//print_r($urls);
unset($url);
$rc = new RollingCurl("request_callback");
// the window size determines how many simultaneous requests to allow.
$rc->window_size = 300;
foreach ($urls as $url)
{
// add each request to the RollingCurl object
$request = new RollingCurlRequest($url);
$rc->add($request);
}
$rc->execute();
function request_callback($response, $info)
{
// parse the page title out of the returned HTML
if (preg_match("~<title>(.*?)</title>~i", $response, $out)) {
$title = $out[1];
}
//echo "<b>$title</b><br />";
//print_r($info);
$parts = parse_url($info['url']);
parse_str($parts['query'], $query);
$id = $query['id'];
$text = $query['text'];
//echo "<hr>";
$trans = json_decode($response)[1][0][1][0];
global $conn; global $table; global $urls; global $n;
if ($trans != '' and !preg_match('/[a-z]/', $trans))
{
$conn->query("update $table set word_indic='$trans' where id='$id'"); $n++;
}
}
?>
Table structure:
CREATE TABLE `words` (
`id` varchar(50) COLLATE utf8_unicode_ci NOT NULL,
`word_eng` varchar(100) COLLATE utf8_unicode_ci NOT NULL,
`word_indic` varchar(100) COLLATE utf8_unicode_ci DEFAULT NULL,
PRIMARY KEY (`id`),
FULLTEXT KEY `word_eng` (`word_eng`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
Solution
Url building
$id = $row['id'];
$word = rawurlencode(getName($row['name_eng']));
//getName is a simple function which does some trimming and cleaning up of string
$url = 'https://www.google.com/inputtools/request?text='.rawurlencode($word).'&ime=transliteration_en_te&id='.rawurlencode($id);
Could be simplified with the use of http_build_query() with encoding type PHP_QUERY_RFC3986
to replace the rawurlencode:
$row['text'] = getName($row['name_eng']);
//unset $row['name_eng'] or select it as "text" in the SELECT statement
$row['ime'] = 'transliteration_en_te'; //or move into SELECT fields in query?
$url = 'https://www.google.com/inputtools/request?'.http_build_query($row, '', '&', PHP_QUERY_RFC3986);
Obviously that will likely result in extra parameters tacked onto the URL (e.g. word_indic
, word_eng
). You could update the select query to specifically select those fields:
$q = "SELECT id, name_eng as text from $table where word_indic is null limit 500000";
And you could select that ime
parameter as a hard-coded value, so as to not need to assign it to each row in the while
, but then that would make the result set larger…
Global variables
I see this line:
global $conn; global $table; global $urls; global $n;
In general it isn’t a good habit to be referencing variables globally. Obviously this is in the context of a callback function but there are ways to avoid that. I would suggest creating a wrapper class and storing the connection, table and other info in member/instance variables while calling the method using the array syntax.
The Singleton Pattern could be used to create a wrapper class on the database code and call a static method to get a connection to the database, instead of referencing $conn
globally.
Also, $urls
and $n
don’t appear to be used after that line anyway…