Visual web ripper duplicate check LATEST
Windows XP / Vista / Windows 7 / Windows 8
User Rating:Click to vote
This information is only relevant when using a version 1 data extractor agent.
When extracting data from websites such as forums, it is often desirable to extract only new data that has been posted since the last time data was extracted. This can be achieved by cancelling data extraction when duplicate data is detected.
Visual Web Ripper saves reference data every time a project is run and the reference data is used to detect duplicate data between project runs.
Visual Web Ripper can cancel an entire data table or only the duplicate data row when duplicate data is found. A data table always corresponds to a Visual Web Ripper template, but not all templates create a new data table. FormSubmit templates and templates defining a list create a new data table.
Extracting Data from a Forum
Incremental web scraping is best illustrated by using an example. The following example explains how to extract data from www.gaiaonline.com/forum/gaming-discussion/f.4/
We will extract data from all topics on the first 100 pages in the forum and update the data on an hourly basis. Extracting data from 100 pages of topics will take quite a while. If we extract data from all topics each time we update the data, we will extract a lot of duplicate data. We will use the duplicate feature to ensure that we extract only new or updated topics.
First we need to design the project. We create a PageArea template to iterate through all the topics on one page and a PageNavigation template to iterate through all the pages. The PageArea template contains the topic type, title and last post date. The PageArea template also contains a link template that links through to the topic detail page and extracts some data from there.
The topic title is displayed in two different locations, depending on the topic type. We use an alternative content element for the title to make sure both locations are covered.
Now we need to decide how to detect duplicate data. The topic title and last post date can be used to detect duplicate data, so we set the Duplicate Check content option on the "title" and "last post date" content elements.
Next, we need to decide which action to take when duplicate data is detected. The default action is to take no action, so we need to change the Duplicate Action option, which is found in the More Options tab. We need to change the option on the template that generates a new data table, which in this case is the PageArea template. We will change the action to CancelDataTable, which will cancel data extraction when duplicate data is first detected.
The next problem is the sticky topics at the top of the forum. These topics remain at the top regardless of whether they have been updated. Because we have told Visual Web Ripper to cancel data extraction when the first duplicate row is detected, Visual Web Ripper will cancel data extraction immediately without checking to see whether new non-sticky topics have been posted.
If we know approximately how many sticky posts are at the top of the forum, we can use the Min. CancelDataRow Checks option. This option specifies the minimum number of rows Visual Web Ripper will process before cancelling the data table. Duplicate data will still be removed, but Visual Web Ripper will continue to iterate through the data until it has processed this minimum number of data rows.
If we do not know how many sticky topics the forum will have, or if we want our approach to be more exact and reliable than guessing the maximum number of sticky topics, we can use a script to decide whether Visual Web Ripper should cancel the entire data table or only the data row. In this case, we could create a script that checks whether the topic type contains the text Announcement: or Sticky:. The duplicate script is executed only when duplicate data is detected and the script must then return the action to be taken.
- using System;
- using mshtml;
- using VisualWebRipper;
- public class Script
- public static DuplicateAction DecideDuplicateAction(WrDuplicateActionArguments args)
- if (args.DataRow[ "type" ] == "Sticky:" || args.DataRow[ "type" ] == "Announcement:" )
- return DuplicateAction.CancelDataRow;
- return DuplicateAction.CancelDataTable;
- catch (Exception exp)
- return DuplicateAction.NoAction;
Visual Web Ripper saves duplicate reference data every time the project has finished extracting data. Visual Web Ripper loads any existing reference data into memory before it starts extracting data, so it can compare previously extracted data with the new data. After the project has run for a long time, the size of the reference data can become substantial. If we cancel data extraction when we first encounter duplicate data, we normally need reference data only from the last data extraction and not from the very first time we ran the project. We can check Discard old reference data to use reference data from only the last time we ran the data extraction project.