Using Windows Presentation Foundation for a Simple Web Scrape

Having spent time delving into Windows Presentation Foundation (WPF) and finding out how things have changed in the last ten years I thought I should actually get on and create a quick test application based on something I may use the technology for. This one retrieves product information from an Amazon page, not really necessary since they have a data feed but it was based on something easily recognisable by anyone wishing to try it out for themselves.

This application took me about an hour to build but that was only because of the learning curve, now I can easily create a similar application in less than ten minutes. One of the new features that I have never come across before is in the layout of WPF Applications. In my previous incarnations adding controls to an application was either a case of doing it completely manually, setting X and Y coordinates, setting height and width or more simply dropping controls from a toolbox in the Visual Studio User Interface (UI) to an application and then setting all the properties. Well the UI is there I see in Visual Studio but there is this XAML code being created at the same time. So I decided I must stop trying to do things the old way and get to know what is going on in XAML. Ideally, I want to find the quickest way of adding controls to an application but allow for some usability. The XAML is incredibly easy to use and edit, in fact much quicker than methods I have used before. I did just put the controls straight onto the application form when I first tried but quickly found the Grid panel easily kept it organised for me.

Here is the XAML script that can just be dropped into a new WPF project in the free Visual Studio 2010 Express if you want to try it out for yourself.

<Window x:Class="MainWindow"
    Title="MainWindow" Height="350" Width="525">
            <RowDefinition Height="Auto"/>
            <RowDefinition Height="Auto"/>
            <RowDefinition Height="*"/>
            <RowDefinition Height="*"/>
            <RowDefinition Height="*"/>
            <RowDefinition Height="Auto"/>
            <ColumnDefinition Width="Auto" />
            <ColumnDefinition Width="*" />
        <Label Content="Address" Grid.Row="0" Grid.Column="0" />
        <TextBox Name="txtAddress" Grid.Row="0" Grid.Column="1" />
        <Label Content="Pattern" Grid.Row="1" Grid.Column="0" />
        <TextBox Name="txtPattern" Grid.Row="1" Grid.Column="1" />
        <Label Content="Web Page" Grid.Row="2" Grid.Column="0" />
        <WebBrowser Name="wbTest" Grid.Row="2" Grid.Column="1" />
        <Label Content="Source" Grid.Row="3" Grid.Column="0" />
        <TextBox Name="txtSource" Grid.Row="3" Grid.Column="1"/>
        <Label Content="Result" Grid.Row="4" Grid.Column="0" />
        <TextBox Name="txtResult" Grid.Row="4" Grid.Column="1"/>
        <Button Content="Fetch Page" Name="btnFetch" Grid.Row="5" Grid.Column="1" HorizontalAlignment="Left"/>
        <Button Content="GREP It" Name="btnGrep" Grid.Row="5" Grid.Column="1" HorizontalAlignment="Right"/>

Using the WebBrowser was a choice I made because I actually wanted to see the results of the page download. As it was added into the XAML I created a quick subroutine to make it navigate to the address I typed in by adding the following code.

Private Sub btnFetch_Click(ByVal sender As System.Object, ByVal e As System.Windows.RoutedEventArgs) Handles btnFetch.Click
End Sub

I wanted it then to pull out the HTML and place it on screen when the page was loaded. In fact twenty minutes got wasted by me when I forgot that you couldn’t get the document text from the WebBrowser control until the page had completely loaded, so notice it is in the LoadCompleted event.

Private Sub wbTest_LoadCompleted(ByVal sender As System.Object, ByVal e As System.Windows.Navigation.NavigationEventArgs) Handles wbTest.LoadCompleted
    txtSource.Text = wbTest.Document.Body.InnerHTML
End Sub

OK – I need to use regular expressions now so I give my application access to it in the .NET framework by adding the following to the top of the source file.

Imports System.Text.RegularExpressions

Running the application and going to Amazon, I did a search on something generic that I knew there would be lots of products – ‘plasma tv’. Pressing the fetch page on the URL for this showed me where to look to run the regular expressions on. You can in this case easily pull out the the URL to the product, then an image URL for it and then a product description.

So I came up with a loose, not really optimised but relatively simple to understand pattern to retrieve the information from the HTML.

<div id=srProductTitle_.*?><a href="(.*?)".*?src="(.*?)".*?>.*?>(.*?)<

Using that expression as the pattern you want basically follows these steps. It looks complex but it really isn’t, you can use this method on the fly and capture items quickly with it.

  • starts at the first DIV beginning with srProductTitle_ and moves forward ignoring any characters until it reaches the string “><a href=”” (including a quotation mark)
  • Captures everything until it finds the next ” (quotation mark) – the URL to the product
  • Ignores everything from there until it finds the string “src=””(including a quotation mark)
  • Captures everything until it finds the next ” (quotation mark) – the URL to the product image
  • Ignores everything from there until it finds a “>”
  • Ignores everything until it finds another “>”
  • Captures the string until it finds a “<” – the product title

Looking around at libraries on the web and writing a bit myself I added a simple routine to pull out these captured strings and place them into the Results textbox

Private Sub btnGrep_Click(ByVal sender As System.Object, ByVal e As System.Windows.RoutedEventArgs) Handles btnGrep.Click
    Dim strOutput As String = ""
    Dim iCount As Integer = 0

        Dim RegexObj As New Regex(txtPattern.Text, RegexOptions.Singleline Or RegexOptions.IgnoreCase)
        Dim MatchResults As Match = RegexObj.Match(txtSource.Text)
        While MatchResults.Success
            iCount += 1
            Dim i As Integer
            For i = 1 To MatchResults.Groups.Count
                Dim GroupObj As Group = MatchResults.Groups(i)
                If GroupObj.Success Then
                    strOutput += Str(iCount) + " (" + Str(i) + ") " + GroupObj.Value + vbNewLine
                End If
            MatchResults = MatchResults.NextMatch()
        End While
        txtResult.Text = strOutput
    Catch ex As ArgumentException
        'ONO an error!?! Never!
    End Try
End Sub

That’s it – It works fine. A no frills scraper that captures three strings from a regular expression pattern from a URL. In my case from Amazon I had 24 product URLs, product image URLs and product titles.

I think I was impressed most with the speed at which I could put together a screen using XAML. As far as I was concerned it didn’t have to be pretty, just functional and this particular one you would be able to create in the matter of a couple of minutes. The only reason is looks busy is because I decided to stick all the controls on a Grid.

This entry was posted in Programming. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *