🔎 A Search tool that crawls the ‘webscraping’ website to provide you with country search functionality!
A Python search tool that crawls, creates an inverted index of all word occurrences in this example scraping website using BeautifulSoup library, which then allows the user to find pages containing the search term.
➰ Project Duration
April, 2020 - May 2020
🎨 Features / 주요 기능
- Crawls all the pages of the website
- Tokenizes the parsed objects and removes styling elements and strips punctuation and whitespaces.
- Creates the inverted index for the whole website
- Prints the inverted list for a certain word
- Find pages containing search terms
- Compute the scores of pages when processing a search query
🐾 Examples / 사용 예제
Commands | Explanation | Syntax |
---|---|---|
Build | It crawls the website, build the inverted index, and save the resulting index into the file system | build |
Load | It loads the existing index from the file system | load |
It prints the inverted index for a particular word | print <word> |
|
Find | It finds a certain query phrase in the inverted index and returns a list of all pages containing this phrase | find <word1> <word2> <word n> |
- Command Line menu
1. Built inverted index
2. Print
3. Find
-
No match (when the search term doesn’t exist):
-
Successful search:
📚 Stack / 개발 환경
- Python
- Requests Library - A simple python library to compose HTTP requests.
- BeautifulSoup - A Python library for navigating, searching, and modifying a parse tree out of HTML and XML files.
⚒ Installation / 실행 방법
Pip-install the following libraries:
pip install requests
pip install beautifulsoup4
pip install urllib
Running the client:
- python3 Crawl and Search.py
📜 License
This project is licensed under the terms of the MIT license.
You can check out the full license here
Leave a comment