Scrape an infinite-scroll page

Posted on

Problem

My algorithm scrapes an infinite-scroll page but it takes too long. It scrolls three times but I’m wondering if there is a way to do a ScrollBottom() so no need of repeated code.

Regarding the site from the example: Scroll is done by jQuery ScrollExtend goo.gl/Sq4vVx triggered when the users scroll beyond a particular tag. When that happens a particular class is added into the tag and removed after the pagination is done.

I think there’s room for improvement code and performance wise.

"use strict";

var Xray = require('x-ray');
var phantom = require('x-ray-phantom');

var phantom_opts = {
    webSecurity: false,
    images: false,
    weak: false
};

var x = Xray().driver(phantom(phantom_opts, function (nightmare, done) {
    done
        .useragent("Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36")
        .goto(nightmare.request.req.url)
        .scrollTo(4000, 0)
        .wait()
        .scrollTo(8000, 0)
        .wait()
        .scrollTo(12000, 0)
        .wait()
}));

x('https://www.compraonline.grupoeroski.com/es/supermercado/2059698-Alimentos-Frescos/2059746-Carnes-y-aves/2059753-Pollo/', '.product_list li',
    [{
        name: '.description_1',
        unitPrice: '.description_2',
        image: '.image_line img@src',
        price: '.product_price_cont p',
        url: '.image_line a@href',
        volumen: '.description_1',
        medida: '.description_1'
    }])(function (err, products) {
        if (err) console.log(err);

        console.log(products.length);

        process.exit(0);
    });

Solution

I would want to understand how the infinite scroll is actually being applied.

  • Do you understand what javascript events actually trigger new items to be added?
  • Does it make more sense to simply trigger those events vs. worry about physically scrolling the browser?
  • Is the content being delivered via AJAX? Can you just query the AJAX endpoint more directly to get to the data you want to get?
  • Is there anything from the ajax response that you need to understand to know when you have reached the end of the list (no more items to be added)?

When you think through these you might find you have a better way to approach the problem.

Leave a Reply

Your email address will not be published. Required fields are marked *