Data from the web is necessary for to help make the decisions for the creation of fresh content for a web site. To do this manually is a wastes valuable time and to use current scraping and automation tools both dubious and over-intensive. So a method that harvests this information is necessary that doesn’t cause issues on the sites that may source it.
Thinking of the architecture needed I have a model forming in my head for the downloading of the pages. Again I understand the very basic way of doing this is in the Microsoft development world is either use a WebBrowser control or use the Web Request calls. The WebBrowser gives the visibility of the page in an application but the Web Request calls give the power over lower details, like cookie management. Either way the model aimed at is for something that is standalone, accepting requests for pages from any of the applications created or existing and returning them according to any internal rules it may have about when and how the call is made.
This standalone application could accept thousands of requests for any pages on the Internet but it will ensure for example that any particular site isn’t called multiple times within a set period of time. This application ideally has to decide which of the many other calls it can make whilst waiting for the ones it has queued. The challenge is to make this as simple as possible without losing any of the mentioned essential functionality. The effect this will have on the calling applications are that they must be written with the idea that they may not receive the data they are looking for possibly hours and days into the future, so always asynchronous.
So a rules based queuing system is what is needed. If one doesn’t exist already then it may be worthy of an Open Source project, but the pressure on time to maintain or evolve it may prevent that.
The other side to this functionality is managing failure. What if the page is not available for any reason? Or what do you do if the structure of the page has changed; obviously there is a need to determine the quickest method of recovery possible.
The bit in the middle of the model that takes most thinking about will be going from the data collected in the database to the data prepared and ready for the web. Clearly the means to manipulate the data for form the content needs to be well formed or possibly even manually done. The ideal result of this system is making available content and information in a form that can be used for writing up fresh content.