Scraping sites with multiprocessing, random proxies and user-agents
https://theodor85.github.io/first_scrap/
First_scrap is a library for multithread scraping sites with random proxies and user-agents.
To get started with the first_scrap library, activate (or create if necessary) your virtual environment. For example, as follows:
python3 -m venv env
source ./env/bin/activate
To install First_scrap use pip package manager:
pip install firstscrap
Another installing approach is getting source code from GitHub. For this execute the commands in your console:
git clone http://github.com/theodor85/first_scrap
cd first_scrap
python setup.py develop
Using example for exctracting data from one web page:
from firstscrap import pagehandler
@pagehandler(parser="BeautifulSoup")
def get_data(url, soup=None):
# your only beatifulsoup code, without any requests, proxies, etc
span = soup.find( name="span", attrs={"class": "p-nickname vcard-username d-block"} )
text = span.get_text().strip()
return text
if __name__ == '__main__' :
print( get_data('https://github.com/theodor85') )
# output:
# theodor85
When extracting data from a single page:
Here is the example:
from firstscrap import listhandler
TEST_URLLIST_OLX = [
'https://www.olx.ua/obyavlenie/spetsialist-po-podklyucheniyu-interneta-IDGnCkB.html',
'https://www.olx.ua/obyavlenie/menedzher-po-robot-s-klentami-IDGkGK6.html',
]
@listhandler(threads_limit=5, parser='BeautifulSoup')
def get_date_time_from_olx(urllist, soup=None):
''' Beautifulsoup code for one page '''
em = soup.find('em')
row_text = em.get_text().strip()
return row_text
if __name__ == '__main__' :
data = get_date_time_from_olx(TEST_URLLIST_OLX)
for item in data:
print(item)
# output:
# Добавлено: в 16:49, 26 декабря 2019, Номер объявления: 626235005
# Добавлено: в 16:18, 29 декабря 2019, Номер объявления: 625536978
The program processes each page in a separate thread, and the number of threads running at the same time does not exceed threads_limit
.
Every thread makes request using random proxy and user-agent.
To run the tests type in your console:
python -m unittest -v tests/tests.py
Before running the tests enjure that your internet connection is active.
Merge you code to the ‘develop’ branch for contributing please.
Forks and pull requests are welcome! If you like first_scrap, do not forget to put a star!
To bug report please mail to fedor_coder@mail.ru with tag “first_scrap bug reporting”.
This project is licensed under the MIT License - see the LICENSE.txt file for details.