Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Note that robots.txt is a hint to well-behaved crawlers, not blocking them in any regard.

You can block crawlers if you can identify them, but reliably identifying them is hard.



We should probably classify the crawler identifying problem as impossible and move along. Less resources wasted and easier automation for everyone. Assuming a crawler is malicious is narrow-minded.



This helps to verify that a bot that announces itself as google bot is indeed a google bot. It doesn’t help identify a bot that pretends to be a user/browser.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: