Being a web developer, it’s often handy to crawl one of your sites and see if any links are broken, or are given plain 500 errors because something is broken.
A classic tool to do this with, is Xenu’s Link Sleuth. That tool is old though, no longer updated and it’s a pure GUI tool. Since I couldn’t find what I wanted as a ready to use command line tool, I got down and wrote my own. It took a while, but recently it became functional enough to be released as a v1 and open it up to the world as an open source tool.
So by this I present *drumroll*, the Sitecrawler command line based site crawler (yeah, I know, naming is hard).
What can it do?
- Crawl a site by following all links in a page. It only crawls internal links and HTML content.
- Crawls links only once. No crawling loops please.
- Possibility to export the crawled links to a CSV file, containing the referrer links (handy for tracking 404’s).
- Limit crawl time to a fixed number of minutes for large sites.
- Set the number of parallel jobs to use for crawling.
- Add a delay to throttle requests on slow sites, or sites with rate limits.
It’s written in .NET 6, so it runs on Windows, Mac and Linux. Check it out on GitHub for more details and downloads. It’s proven useful for me already, so I hope it does the same for you.
The post an open source web crawler appeared first on n3wjack's blog.