Tokenizing a large document

Posted on

Problem

I’m currently trying to process a corpus of a million patent text files, which contain about 10k non-unique words on average. My current data pipeline works as follows:

  1. Load the patent texts as strings from a local sqlite database table
  2. Tokenize¹ each document and save the results in a new table
  3. Create a dictionary containing all unique words in the corpus
  4. Train an tfidf model with the tokenized documents

¹Tokenization means taking a document text (string) as input and returning a list containing every word in the document (duplicates allowed). Words tend to be separated by spaces, special characters, numbers etc., the regex in my code has served quite well for this purpose.

In my data pipeline, I identified the tokenize function as my bottleneck, the relevant part is provided in my MWE below:

import re
import urllib.request
import time

url='https://raw.githubusercontent.com/mxw/grmr/master/src/finaltests/bible.txt'
doc=urllib.request.urlopen(url).read().decode('utf-8')

PAT_ALPHABETIC = re.compile(r'[^Wd]+')

def tokenize(text):
    matches=PAT_ALPHABETIC.finditer(text)
    for match in matches:
        yield match.group()

def preprocessing(doc):
    tokens = [token for token in tokenize(doc)]
    return tokens


start_time = time.time()
preprocessing(doc)
print("--- %s seconds ---" % (time.time() - start_time))

Solution

Some small performance can be gained by directly yielding from the iterator instead of having a for loop do it, using the yield from keywords, and using list() instead of a list comprehension:

def tokenize2(text):
    yield from PAT_ALPHABETIC.finditer(text)

def preprocessing2(doc):
    return list(tokenize2(doc))

For the given example document this gives about a 15% speed-up:

In [15]: %timeit preprocessing(doc)
335 ms ± 2.29 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [16]: %timeit preprocessing2(doc)
287 ms ± 2.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Slightly faster yet is not even having the preprocessing function and directly returning all tokens (this avoids one function call and lets the re do its best):

def tokenize3(text):
    return PAT_ALPHABETIC.findall(text)

This is about 35% faster then the code in the OP:

In [21]: %timeit tokenize3(doc)
217 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Other than that, we can’t really help you without seeing more of your script. You can probably parallelize this task by scanning multiple documents in parallel, and especially by making downloading and scanning asynchronous, so that you scan a document whenever it finishes downloading, but already download the next document(s) in the background.

I would guess that you’re never going to take another 50% off. You could consider pulling the problem down into C or Haskell or Awk or whatever and then binding that more tightly optimized implementation back up into python; IDK.

As for more incremental improvements, it depends a bit how the thing is going to be used.

You flagged the question for multi-threading. That certainly might help, but you’ll have to figure out how to divy up the job. Maybe there are natural split-off points that can be identified without actually reading the subject data, in which case any parallelization system will probably work. Maybe you can figure out the size of the subject on disk, and open multiple readers against that file starting at evenly spaced offsets; in this case you’ll need to figure out the fence-posting when you combine the results of each of the workers.

I notice that the function preprocessing just wraps the iterable results up in a list. Is that necessary? If you can avoid ever actualizing the whole 10k-items list in memory that will likely help some.

Similarly, you’re reading the whole file (http payload) as a single string. Both file handles and HTTPResponse objects will let you read just a chunk at a time; this would let you handle the subject data like a lazy iterable (similar to the way you’re yielding from .finditer()). Of course how much of an advantage this is for you will depend on the underlying implementation, and you’ll have to be careful about fence-posting again. This may also be a good way to break off jobs for parallelization.

Leave a Reply

Your email address will not be published. Required fields are marked *