What is a Web Crawl?
A robot program that systematically browse specific web pages to learn what each page on the website is about, so this information can be indexed, updated and retrieved when a user makes a search query. Web crawlers begin crawling a specific set of known pages, then follow hyperlinks from those pages to new pages.
{FYI...Google and Bing Crawls the internet nightly on a very large scale, they determine what you actually see. In addition, Robots.txt will be observed.}
ThumbCrawl - An open source Web Crawla straight forward non-biased web crawl request.{Google cannot say that!} |
Details of this request
This ThumbCrawl request will have you enter the five specific URLs and two alternate URLs. The five URLS will be entered into the Seed.txt file and a scripted batch crawl activated. The result will be captured into a Solr instance. The configured environment is largely out of the box settings using default configuration files. (robots.txt observed)
– The batch scrip in the request will perform an n=3 looping iteration/depth.
Tools
The Crawl will be performed using the Open Source Nutch-Apache tech stack. The tools used will use common open source tools and follow common crawl rules (robots.txt observed). No customization will me made on my search configuration, so it will be a non-biased/ influenced return of finding using these tools. {Honest results with no results/marketing gaming. Second point, if you are a non-profit, start-up or small education/civil group, happy to waive the fee - my decision.}
Result Output
Here at three items that will be returned from the request: {ie: best effort made on batch run, bad URLs and Robot.txt often discovered}
- Screen Crawl Log
- Nutch application created log file
- Solar instance of the crawl (if a crash , this is empty)
Your five supplied URLs will be manually checked Prior crawl to reduce Harmful Actions (ie: no stalking, intimidaytion and under age activity...etc).
** NOTE: If your URLs are suspicions, the requested crawl will not be Activated/Run . {funds returned - $5 review fee}
Next Step, complete this form: Crawl Order Form