By convention, websites may maintain a file robots.txt on their web servers that tells crawlers such as Google, Bing, and Baidu not to index portions of their websites. By convention, crawlers follow this advice.
Interestingly, robots.txt reveal information about the operations of the websites itself. They can tell us to which degree a website is dynamic, or which content the website operator may consider private or non-indexable, for whatever reason.
In this project, we are going to extend an Internet scanner to detect robots.txt on a webserver. We download the files and analyse them to identify prevalence, functionality, misconfigurations, and possible security holes.
Keywords: Internet scans, network security, data analysis
Advisor(s): Ralph Holz
Suitable for: Capstone