Skip to content

robotparser do not allow some url which are allowed #110333

@FedeGerva

Description

@FedeGerva

Bug report

Bug description:

I was using robotparser to check some websites and I realized that the there was a strange behavior. I gave it some urls that I knew could be crawled and it return me False. After some debugging I realize that the url was ok but I think there is a problem in the function that is reading the url.

Line 62 of https://github.com/python/cpython/blob/main/Lib/urllib/robotparser.py

f = urllib.request.urlopen(self.url)

returned my an error code 404, but actually the url/robots.txt file was present and I could see it online.

What I did was to add these two lines of code, the error disappear and the function told me the url could be crawled, as I expected. :

header= {'User-Agent': '*'}  
req = urllib.request.Request(url=self.url, headers=header) 
f = urllib.request.urlopen(req)

Could you give me a feedback on that? Do you think it is correct and could be fixed in the main version?

CPython versions tested on:

3.8, 3.9

Operating systems tested on:

Windows

Metadata

Metadata

Labels

stdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions