• Personal Blog

    Posted on March 21st, 2014

    Written by Denis Finnegan

    Tags

    Tesco Food Product Import

    So it’s finally finished, what started out as a pet project but one that went on to be further expaned beyond prototype and turned into a Mammoth production project. Using Nick Lansley’s Tesco Grocery API and built on the know how from my Realfood Scraper, I created a single Python Script consisting of about 5000 lines of code that goes through each, department, aisle and shelf on Tesco.com and parses those products and then finally loads them into the database along with all their nutritional information.

    Tesco.com Python Web Scraper

    Tesco.com Python Web Scraper

    Sounds simple right! Wrong, honestly, you really wouldn’t believe the amount of variations that exist. I found over 50 variations alone for the way the way that Calories and Kilojoules information is desiplayed and having the correct calories is critical to the process of sometimes working out other nutrients and or figuring out the serving weight if not stated. There were thousands of products missing nutritional data entirely and many more that did not list serving information of any kind. There were huge inconsistencies with attribute names and values, you could have decimals, commas, gs for grams or strange characters e.g. 1g, 1gr, 100mg, 1 gram and the list goes on.

    There were many more complex problems beyond the above and things you’d never expect to pop like issue of raw vs cooked products and the serving data for these. If you don’t know what I’m taking about, a few months down this rabbit hole and you will understand, and further more, you’d have a huge new level of respect for the kind of work the Data & Nutrition Manager, Dr. Claire Kehoe does on my team.

    Anyway, at the end of it all, what we ended up with was a robust script that could basically handle any product thrown at it and even do clever things like figure out missing nuttrients with clever calculations from the ones that were available or based on the product type, lets take a simple example, water, that if many of its nutrients are missing, they can be populated with zeros.

    I was very proud of what I produced at the end of this project, though I speant perhaps a day or two per week on it for about 3 to 4 months. The data that was produced was of a very high quality and by the end of the process, we reached coverage rates of 91% and of 76% accuracy. To put this in perspective, the Dunhumby Team looked at this data and estimated that only about 2% of the data online was perfectly usable. On our initial runs after putting in some basic cleanup, we were hitting about 47% and by the end of the import process we were up to about 63%. After this, we did a lot of extra semi-automated work to bring these scores up even further and my Development Team were part of this as there followed on 4 further stages to the process of brand matching products and genericizing them so that when you searched coke for example, you didn’t get all shapes and sizes, you just got coke. Unfortunately, I can’t share the code or any of the details around the whole process for obvious reasons but I do just want to acknowlege this moment as the products have now gone live to customers and with a barcode scanner in place on our Tesco Health and Wellbeing App, it really has made it all worthwhile.

    SUMMARY

    I think, everyone thinks, there must be some business out there who has all the product data in the world or most of it anyway but there really isn’t. After waiting since 2009 to get a feed of products from Tesco.com, we got sick of waiting and just did it ourselves as we so often have had to do. That said, I’m immensly proud of this piece of work and I’m hopeful that our customers will be too.

    This entry is filed under Personal Blog. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.
  • 0 Comments

    Take a look at some of the responses we have had to this article.

  • Leave a Reply

    Let us know what you thought.

  • Name(required):

    Email(required):

    Website:

    Message:

    CAPTCHA Image
    Play CAPTCHA Audio
    Reload Image