• Personal Blog

    Posted on May 20th, 2013

    Written by Denis Finnegan

    Tags

    Python

    Back in 2013, I really felt I was losing touch with development. Aside from the Batch and VBScriping code I had written in my System Administrator days, I really hadn’t done any coding since college when I did some VB6, Java and C#.Net. At the same time in we had just launched Tesco Health and Wellbeing and wanted to signficantly increase the number of products we stored in our database, as well as the amount of Recipes we had. So I descided, why not see if one problem can’t help me solve another.

    PythonI often feel that you need to have something worthwhile to aim for with these kinds of little projects and getting data in for the business seemed like a worthwhile endevour.

    I had a look around and I saw quite a bit around web scraping using Python with plenty samples of prewritten code. I also read that Python was typically used for a lot of back end server side heavy lifiting which is definately somewhere I’d be more comfortable with from my own background.

    In the end up this was my choice “PYTHON” and we’ve been great pals since. I loved that Python is so loosely coupled and so very forgiving to code with, which was great for me starting out. I could declare a variable without having to worry about its type. I also loved how easy it was to grab and install frameworks and for my first outing in my web scraping adventures, I used a couple of key Frameworks that I think I will be very likely to use again in the future.

    URL Navigation

    I originaly started using ‘Mechanise’ for navigating to and interacting with web pages and while the ease at which you could select and populate forms was fantastic, but I found it still had limiations when you were more heavily reliant on a lot of page navigation. It just didn’t have the depth I needed in this space.

    Web ScraperInstead, I moved on and used a combination of ‘Urllib2’ and ‘Beautiful Soup’, two fantasitic frameworks. Urllib2 handles all my page navigation and I then load these pages into Beautiful Soup which does all the parsing I need. You can naviagte using class names, attributes, plain text, Xpaths but most importantly, you can move to parent, children or siblings withing the code which I found hugely poweful. Of course Regex is never too far away from projects like these and so the already existing ‘re’ framework was used a lot.

    Still it wasn’t a difficult project over all, no database interaction, I just output the restults to a CSV file and 300 lines or so of code later, and I had my little scraper up and running.

    So, without further adéu, here is my code. I wasn’t much of one for commenting so just shout if you want to ask me questions or email me at ( d e n i s f i n n e g a n @ g m a i l . c o m )

    Real Food Web Scraper Code:

    # -*- coding: utf-8 -*-
    import sys, traceback

    # Imported to delete files
    import os

    # WOrking with CSV Files
    import csv

    # Regex processing
    import re

    # URL Navigation
    import urllib2

    from bs4 import BeautifulSoup

    # Delete the last file before starting
    os.remove(“RealFood.txt”)

    #to track how many recipes have been processed
    i = 0

    # TESTING | to test set testMe = “test”
    testMe = “”
    #5# “http://realfood.tesco.com/recipes/cream-of-pea-soup-with-feta.html”
    #5# “http://realfood.tesco.com/recipes/peanut-butter-and-choc-chip-flapjacks.html”
    #4# “http://realfood.tesco.com/recipes/british-asparagus-and-ricotta-pizza.html”
    #3# “http://realfood.tesco.com/recipes/triple-chocolate-brownies-with-raspberry-cream.html”
    #2# “http://realfood.tesco.com/recipes/pink-lemonade.html”
    #1# “http://realfood.tesco.com/recipes/pear-and-celeriac-soup-with-chives.html”
    testHTML = urllib2.urlopen(“http://realfood.tesco.com/recipes/cream-of-pea-soup-with-feta.html”).read()
    testSoup = BeautifulSoup(testHTML)#.decode(‘utf-8′,’ignore’))

    html = urllib2.urlopen(‘http://realfood.tesco.com/recipes/ingredients/apples-and-pears-recipes.html’).read()
    #html = urllib2.urlopen(“”).read()
    soup = BeautifulSoup(html)#.decode(‘utf-8′,’ignore’))

    recipeListofURLs = []
    dropNavListofURLS = []
    listForRowWriting = []

    f = open(“C:\Python27\RealFood.txt”,”ab”)
    f.write(“Recipe Name|PercentageGDACals|Calories|PercentageGDASugar|Sugar|PercentageGDAFat|Fat|PercentageGDASats|Saturates|PercentageGDASalt|Salt|Serves|Preparation|Total Calories|Total Sugar|Total Fat|Total Saturates|Total Salt|Instructions|Image|Ingredient1|Ingredient2|Ingredient3|Ingredient4|Ingredient5|Ingredient6|Ingredient7|Ingredient8|Ingredient9|\r\n”)

    def RepresentsInt(s):
    try:
    int(s)
    return True
    except ValueError:
    return False

    def scrape_recipe_page(soup):

    #############
    # SCRAPING  #
    #############

    # Recipe Name
    recipeName = soup.find(‘div’, class_=’breadcrumbs’)
    print recipeName.contents[5].text.encode(‘ascii’,’ignore’).replace(”  “, ” “).replace(“recipe”, “”)
    listForRowWriting.append(recipeName.contents[5].text.encode(‘ascii’,’ignore’).replace(”  “, “”).replace(“recipe”, “”).replace(”   “, “”))

    # Serving Contains

    ## Calories
    try:
    gdaAmountCalories = soup.find(‘span’, attrs={‘itemprop’ : ‘calories’})
    print gdaAmountCalories.next_sibling.next_sibling.text + ” – PercentageGDACals”
    listForRowWriting.append(str(gdaAmountCalories.next_sibling.next_sibling.text))
    print gdaAmountCalories.text + ” – Calories”
    listForRowWriting.append(str(gdaAmountCalories.text))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “0 – PercentageGDACals”
    listForRowWriting.append(“0”)
    print “0 – Calories”
    listForRowWriting.append(“0″)

    ## Sugar
    try:
    gdaAmountSugar = soup.find(‘span’, attrs={‘itemprop’ : ‘sugarcontent’})
    print gdaAmountSugar.next_sibling.next_sibling.text + ” – PercentageGDASugar”
    listForRowWriting.append(str(gdaAmountSugar.next_sibling.next_sibling.text))

    gdaAmountSugar = gdaAmountSugar.text.replace(“g”, “”)
    print gdaAmountSugar + ” – Sugar”
    listForRowWriting.append(str(gdaAmountSugar ))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “0 – PercentageGDASugar, ”
    listForRowWriting.append(“0”)

    print “0 – Sugar, ”
    listForRowWriting.append(“0″)

    ## Fat Content
    try:
    gdaAmountFat = soup.find(‘span’, attrs={‘itemprop’ : ‘fatcontent’})
    print gdaAmountFat.next_sibling.next_sibling.text + ” – PercentageGDAFat”
    listForRowWriting.append(str(gdaAmountFat.next_sibling.next_sibling.text))

    gdaAmountFat = gdaAmountFat.text.replace(“g”, “”)
    print gdaAmountFat + ” – Fat”
    listForRowWriting.append(str(gdaAmountFat))

    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “0 – PercentageGDAFat”
    listForRowWriting.append(“0”)
    print “0 – Fat”
    listForRowWriting.append(“0”)

    ## Saturates
    #used to get salt valus later
    #need to declare gobal variable or try catch has an error when it gets to salt
    global gdaAmountSaltPercentage
    gdaAmountSaltPercentage = “0”
    global gdaAmountSalt
    gdaAmountSalt = “0”

    try:
    gdaAmountSaturates = soup.find(‘span’, attrs={‘itemprop’ : ‘saturatedFatContent’})
    #special case get for salt before chaning the gdaAmountSaturates essence
    gdaAmountSaltPercentage = gdaAmountSaturates.parent.next_sibling.next_sibling.contents[5].text.encode(‘ascii’,’ignore’)
    gdaAmountSalt = gdaAmountSaturates.parent.next_sibling.next_sibling.contents[3].text.replace(“g”, “”)
    #ok, continue
    print gdaAmountSaturates.next_sibling.next_sibling.text + ” – PercentageGDASats”
    listForRowWriting.append(str(gdaAmountSaturates.next_sibling.next_sibling.text))

    gdaAmountSaturates = gdaAmountSaturates.text.replace(“g”, “”)
    print gdaAmountSaturates + ” – Sats”
    listForRowWriting.append(str(gdaAmountSaturates))

    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “0 – PercentageGDASats”
    listForRowWriting.append(“0”)
    print “0 – Sats”
    listForRowWriting.append(“0″)

    ## Salt
    try:
    print gdaAmountSaltPercentage + ” – PercentageGDASalt”
    listForRowWriting.append(str(gdaAmountSaltPercentage))

    print gdaAmountSalt + ” – Salt”
    listForRowWriting.append(str(gdaAmountSalt))

    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “0 – PercentageGDASalt”
    listForRowWriting.append(“0”)
    print “0 – Salt”
    listForRowWriting.append(“0”)

    # Serves
    try:
    serves = soup.find(‘li’, class_=’recipeServes tt’)
    serves = serves.span.string
    #print type(re.search(‘(\d{1,5})ml’, serves).group(1))
    if RepresentsInt(serves) is True:
    print “Serves1: ” + str(serves)
    listForRowWriting.append(str(serves))
    else: #(re.search(‘(\d{1,5})ml’, serves).group(1) != “”):
    servesFromInstructions = soup.find(‘span’, attrs={‘itemprop’ : ‘recipeinstructions’})
    servesFromInstructions = re.search(‘(\d{1,5})ml’, str(servesFromInstructions)).group(1)
    #print “Serves: TEMP ” + servesFromInstructions
    servesNew = soup.find(‘li’, class_=’recipeServes tt’)
    servesNew = servesNew.span.text
    #print “HEllo Serves: ” + servesNew
    servesNew = re.search(‘(\d{1,5})ml’, servesNew).group(1)
    #print “Serves: TEMP 2” + serves
    serves = int(servesNew) / int(servesFromInstructions)
    print “Serves2: ” + str(serves)
    listForRowWriting.append(str(serves))

    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    if     soup.find(‘li’, class_=’recipeServes tt’) == None:
    print “Serves3: 0 (EXCEPTION!)”
    listForRowWriting.append(“0”)
    elif RepresentsInt(serves) is False:
    serves = str(serves)
    serves = re.search(‘(\d{1,5})’, serves).group(1)
    print “Serves4: ” + str(serves)
    listForRowWriting.append(str(serves))
    else:
    print “Serves5: 1 (EXCEPTION!)”
    listForRowWriting.append(“1”)
    #traceback.print_exc(file=sys.stdout)

    # Preparation
    preparation = soup.find(‘li’, class_=’recipeTakes tt’)
    try:
    print “Preparation: ” + preparation.span.string
    listForRowWriting.append(preparation.span.string)
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “Preparation: 0”
    listForRowWriting.append(“0”)

    # Totals
    try:
    totCals = float(gdaAmountCalories.text) * int(serves)
    print “total Calories = %1.f” % (totCals)
    listForRowWriting.append(str(totCals))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “total Calories = 0”
    listForRowWriting.append(“0”)

    try:
    totSugar = float(gdaAmountSugar) * float(serves)
    print “total Sugar = %1.f” % (totSugar)
    listForRowWriting.append(str(totSugar))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “total Sugar = 0”
    listForRowWriting.append(“0”)

    try:
    totFat = float(gdaAmountFat) * float(serves)
    print “total Fat Content = %1.f” % (totFat)
    listForRowWriting.append(str(totFat))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “total Fat Content = 0”
    listForRowWriting.append(“0”)

    try:
    totSats = float(gdaAmountSaturates) * float(serves)
    print “total Saturates = %1.f” % (totSats)
    listForRowWriting.append(str(totSats))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “total Saturates = 0”
    listForRowWriting.append(“0”)

    try:
    totSalt = float(gdaAmountSalt) * float(serves)
    print “total Salt = %1.f” % (totSalt)
    listForRowWriting.append(str(totSalt))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “total Salt = 0”
    listForRowWriting.append(“0”)

    # Instructions
    instructions = soup.find(‘span’, attrs={‘itemprop’ : ‘recipeInstructions’})

    #print instructions.text.replace(“See more Lunch recipes”, “”)
    #print instructions.text.encode(‘utf-8’)
    #creating another list to put all the instructions into and then add to the product list
    instructionsList = []
    for childInstruction in instructions.children:
    try:
    #print “Hello Child: ” + childInstruction.text + “End of Cild”
    instructionsList.append(childInstruction.text.encode(‘ascii’,’ignore’))
    print childInstruction.text.encode(‘ascii’,’ignore’)
    #print child.text.encode(‘ascii’,’ignore’)
    #listForRowWriting.append(child.text.encode(‘ascii’,’ignore’))
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    #print child.string
    instructionsList.append(childInstruction.string)
    print childInstruction.string
    #add the contents of the instructions List to the mainlist
    listForRowWriting.append(instructionsList)

    # Image
    image = soup.find(‘div’, class_=’halfCol’)
    print image.img[‘src’]
    listForRowWriting.append(image.img[‘src’])

    for eachIngredient in soup.find_all(‘span’, attrs={‘itemtype’ : ‘ingredients’}):
    ingredient = eachIngredient.text.encode(‘ascii’,’ignore’).replace(‘\n’,”).replace(‘\r’,”).replace(‘\n\n’,”)
    print ingredient
    listForRowWriting.append(ingredient)

    ######################################################

    #to track how many recipes have been processed
    global i
    i = i + 1
    print “Recipe #: ” + str(i)

    # Write Collected Data to File
    line_length = len(listForRowWriting[0])
    #for every item in the list, write each item to file starting at 0, up to the length of the list and in steps of 1
    for j in range(0, len(listForRowWriting), 1):
    f.write(str(listForRowWriting[j]) + “|”)
    f.write(str(“\r\n”))
    del listForRowWriting[:]

    def scrape_recipe_list(soup):

    ###############################
    # SCRAPING LINKS FOR RECIPES  #
    ###############################

    for eachItem in soup.find_all(‘a’, class_=”landscapeThumb”):
    rescipeList = eachItem.get(‘href’)
    recipeListofURLs.append(rescipeList)

    # next visit each page
    for eachRecipe in recipeListofURLs:
    newURL = “http://realfood.tesco.com” + eachRecipe
    print newURL
    html1 = urllib2.urlopen(newURL).read()
    soup1 = BeautifulSoup(html1)#.decode(‘utf-8′,’ignore’))
    scrape_recipe_page(soup1)

    def scrape_nav_droplist(soup):

    ###############################
    # SCRAPING LINKS FOR RECIPES  #
    ###############################

    #get all the recipes
    dropNavList = soup.find(‘ul’, class_=”dropNav”)
    #dropNavList = dropNavList.next_sibling

    for child in dropNavList.children:#dropNavList.next_sibling.next_sibling:
    try:
    print child.a[‘href’]
    #print type(child)
    dropNavListofURLS.append(child.a[‘href’])
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “Exception”

    for eachdroplist in dropNavListofURLS:

    newParentURL = “http://realfood.tesco.com” + eachdroplist
    print newParentURL
    html = urllib2.urlopen(newParentURL).read()
    soup = BeautifulSoup(html)#.decode(‘utf-8′,’ignore’))
    # First get all the recipes we need to scrape
    scrape_recipe_list(soup)

    ##################################################################################################################
    ############################# SEARCH LISTING RETRIEVAL FUNCTIONS #################################################
    ##################################################################################################################

    def define_source(passedInHTML):

    # This is where you load all the html from the page in question into soup to be parsed
    html = urllib2.urlopen(passedInHTML).read()#’http://realfood.tesco.com/recipes/search.html?page=2′).read()
    soup = BeautifulSoup(html)#.decode(‘utf-8′,’ignore’))

    get_next_page(soup)
    get_search_result_links(soup)

    def get_next_page(soup):

    global nextPage
    nextPage = soup.find(‘li’, class_=”nextPage”)
    nextPage = nextPage.a.get(‘href’)
    print nextPage

    def get_search_result_links(soup):

    searchResultsList = soup.find(‘ul’, class_=”searchResults”)
    temp = ‘#’

    for link in searchResultsList.find_all(‘a’):

    if link.get(‘href’) == temp:
    print “skipping duplicate link”
    elif link.get(‘href’) == ‘#’:
    print “skipping dud # link”
    else:
    print(link.get(‘href’))
    newURL = “http://realfood.tesco.com” + link.get(‘href’)
    print newURL
    html1 = urllib2.urlopen(newURL).read()
    soup1 = BeautifulSoup(html1)#.decode(‘utf-8′,’ignore’))
    scrape_recipe_page(soup1)

    temp = link.get(‘href’)

    else:
    print “finished this pages search results”
    try:
    define_source(‘http://realfood.tesco.com’ + nextPage)
    except (ValueError, AttributeError, RuntimeError, TypeError, NameError):
    print “End of all search results”

    #########################################
    # ~ D I R E C T I V E ~ S E C T I O N ~ #
    #########################################

    # Only for 1 page testing
    if testMe == “test”:
    scrape_recipe_page(testSoup)
    else:
    ###################################
    # GET RECIPES FROM NAVIGATION BAR #
    ###################################
    # get the nav list
    scrape_nav_droplist(soup)

    ###################################
    # GET RECIPES FROM SEARCH RESULTS #
    ###################################
    # This drives results from the search area of the site
    #define_source(‘http://realfood.tesco.com/recipes/search.html’)
    define_source(‘http://realfood.tesco.com/recipes/search.html?page=606’)

    f.close()

     

     

    This entry is filed under Personal Blog. You can follow any responses to this entry through the RSS 2.0 feed. You can skip to the end and leave a response. Pinging is currently not allowed.
  • 0 Comments

    Take a look at some of the responses we have had to this article.

  • Leave a Reply

    Let us know what you thought.

  • Name(required):

    Email(required):

    Website:

    Message:

    CAPTCHA Image
    Play CAPTCHA Audio
    Reload Image