huazi

huazi

Write a small web crawler tool using Python.

So here's the thing, our company uses a third-party payment tool called ping++, and for work, we need to convert the bank codes and bank names in their documentation into a JSON string for local use. The specific link is here: https://www.pingxx.com/api#bank-code-explanation

Now the question is, how should we handle this bank code table? We can't just copy each one manually, that would be too inefficient and not the way of a programmer. So I asked my iOS colleague for a copy of the JSON data they parsed...hahaha...it's that simple.Yes, I did get it, but I still need to do it myself. I need to be good at using code to solve problems that can be solved with code in life.

Once I figured out what I needed to do, it was easy. The first thing that came to mind was using Python web scraping to parse the webpage and analyze the data.

Key Modules for Building a Web Crawler#

  • URL Manager
    Used to manage the URLs that need to be crawled and the URLs that have already been crawled. You can use a set or other tools for this. In this case, since we are only crawling a single fixed webpage, we can use a string to represent it.
  • HTML Downloader
    Downloads the webpage corresponding to the URL in the URL manager, saves it as a string, and then passes this string to the HTML parser for parsing. Here, we use the built-in urllib module in Python. In Python 2.7, this module is known as urllib2, but in Python 3.0, urllib2 is unified under urllib.request and the usage is slightly different. You can Google it for more information.
  • HTML Parser
    On one hand, it parses out valuable data, and on the other hand, since each page has many links to other pages, after parsing these URLs, they can be added to the URL manager for future crawling. Here, we use the third-party module BeautifulSoup, which needs to be installed. You can install it using the following methods:
$ easy_install beautifulsoup4

or

pip install beautifulsoup4

Downloading the Webpage#

import urllib.request

response = urllib.request.urlopen("https://www.pingxx.com/api#bank-code-explanation")
print(response.read())

That's right, it only takes two lines of code to download a webpage. In fact, one line is enough.

Parsing the Webpage#

The webpage we downloaded above is the source code of an HTML page. To parse it, BeautifulSoup supports multiple parsing methods to obtain the data you want. Here, you need to go to the website and look at the source code to see where the data you need is located. You can use Command+Option+J to inspect the elements and find the available places in the source code for scraping. The image below shows the source code we located.

Paste_Image.png

After inspecting it, we found that the data we need is in the table under a specific div. We need to get the table and then parse it. However, it's not easy to directly get the div, but there is an

<h2 id="bank-code-explanation">Bank Code Explanation</h2>

that is easier to get. So we'll start with that, then get its parent node, and then get the table.
The code is as follows:

import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen("https://www.pingxx.com/api#bank-code-explanation")
soup = BeautifulSoup(response, "html.parser")
table = soup.find("h2", id="bank-code-explanation").parent.find("table").find("tbody")

This way, we have obtained the table. The rest is simple, just need to process the table. First, get all the tr elements, then get the td elements in each tr, and then get the values we need from the td elements to concatenate the JSON string.
The complete code is as follows:

import urllib.request
from bs4 import BeautifulSoup

response = urllib.request.urlopen("https://www.pingxx.com/api#bank-code-explanation")
soup = BeautifulSoup(response, "html.parser")
table = soup.find("h2", id="bank-code-explanation").parent.find("table").find("tbody")

bankJson = "["
for row in table.findAll('tr'):
    if len(row) > 2:
        cells = row.findAll('td')
        bank_code = cells[0].find(text=True)
        bank_name = cells[1].find(text=True)
        bankJson = bankJson + "{" + "\"code\":\"" + bank_code + "\"," + "\"name\":\"" + bank_name + "\"},"

bankJson = bankJson[0:len(bankJson) - 1] + "]"
print(bankJson)

Running the code will print the following result:

[{"code":"0100","name":"China Postal Savings Bank"},{"code":"0102","name":"Industrial and Commercial Bank of China"},{"code":"0103","name":"Agricultural Bank of China"},{"code":"0104","name":"Bank of China"},{"code":"0105","name":"China Construction Bank"},{"code":"0301","name":"Bank of Communications"},{"code":"0302","name":"China CITIC Bank"},{"code":"0303","name":"China Everbright Bank"},{"code":"0304","name":"Huaxia Bank"},{"code":"0305","name":"China Minsheng Bank"},{"code":"0306","name":"China Guangfa Bank"},{"code":"0308","name":"China Merchants Bank"},{"code":"0309","name":"Industrial Bank Co., Ltd."},{"code":"0310","name":"Shanghai Pudong Development Bank"},{"code":"0311","name":"Hengfeng Bank"},{"code":"0313","name":"Linyi Commercial Bank"},{"code":"0316","name":"China Zheshang Bank"},{"code":"0317","name":"Bank of Bohai"},{"code":"0318","name":"Ping An Bank"},{"code":"0328","name":"Shinhan Bank (China)"},{"code":"0329","name":"Hana Bank (China)"},{"code":"0336","name":"Enterprise Bank"},{"code":"0401","name":"Bank of Shanghai"},{"code":"0402","name":"Xiamen Bank"},{"code":"0403","name":"Bank of Beijing"},{"code":"0404","name":"Yantai Commercial Bank"},{"code":"0405","name":"Fujian Haixia Bank"},{"code":"0406","name":"Jilin Bank"},{"code":"0408","name":"Ningbo Bank"},{"code":"0412","name":"Wenzhou Bank"},{"code":"0413","name":"Guangzhou Bank"},{"code":"0414","name":"Hankou Bank"},{"code":"0418","name":"Luoyang Bank"},{"code":"0420","name":"Dalian Bank"},{"code":"0422","name":"Hebei Bank"},{"code":"0423","name":"Hangzhou Commercial Bank"},{"code":"0424","name":"Nanjing Bank"},{"code":"0427","name":"Urumqi Commercial Bank"},{"code":"0428","name":"Shaoxing Bank"},{"code":"0433","name":"Huludao Commercial Bank"},{"code":"0434","name":"Tianjin Bank"},{"code":"0435","name":"Zhengzhou Bank"},{"code":"0436","name":"Ningxia Bank"},{"code":"0438","name":"Qishang Bank"},{"code":"0439","name":"Jinzhou Bank"},{"code":"0440","name":"Huishang Bank"},{"code":"0441","name":"Chongqing Bank"},{"code":"0442","name":"Harbin Bank"},{"code":"0443","name":"Guiyang Bank"},{"code":"0447","name":"Lanzhou Bank"},{"code":"0448","name":"Nanchang Bank"},{"code":"0449","name":"Jinshang Bank"},{"code":"0450","name":"Qingdao Bank"},{"code":"0455","name":"Rizhao Commercial Bank"},{"code":"0456","name":"Anshan Bank"},{"code":"0458","name":"Qinghai Bank"},{"code":"0459","name":"Taizhou Bank"},{"code":"0461","name":"Changsha Bank"},{"code":"0463","name":"Ganzhou Bank"},{"code":"0465","name":"Yingkou Bank"},{"code":"0467","name":"Fuxin Bank"},{"code":"0474","name":"Inner Mongolia Bank"},{"code":"0475","name":"Huzhou Commercial Bank"},{"code":"0476","name":"Cangzhou Bank"},{"code":"0479","name":"Baoshang Bank"},{"code":"0481","name":"Weihai Commercial Bank"},{"code":"0483","name":"Panzhihua Commercial Bank"},{"code":"0485","name":"Mianyang Commercial Bank"},{"code":"0490","name":"Zhangjiakou Commercial Bank"},{"code":"0492","name":"Longjiang Bank"},{"code":"0495","name":"Liuzhou Bank"},{"code":"0497","name":"Laishang Bank"},{"code":"0498","name":"Deyang Bank"},{"code":"0503","name":"Jincheng Bank"},{"code":"0505","name":"Dongguan Commercial Bank"},{"code":"0508","name":"Jiangsu Bank"},{"code":"0513","name":"Chengde Commercial Bank"},{"code":"0515","name":"Dezhou Bank"},{"code":"0517","name":"Handan Commercial Bank"},{"code":"0525","name":"Zhejiang Mintai Commercial Bank"},{"code":"0526","name":"Shangrao Commercial Bank"},{"code":"0527","name":"Dongying Bank"},{"code":"0528","name":"Taian Commercial Bank"},{"code":"0530","name":"Zhejiang Chouzhou Commercial Bank"},{"code":"0534","name":"Ordos Bank"},{"code":"0537","name":"Jining Bank"},{"code":"0547","name":"Kunlun Bank"},{"code":"0554","name":"Xingtai Bank"},{"code":"0556","name":"Luohe Commercial Bank"},{"code":"1401","name":"Shanghai Rural Commercial Bank"},{"code":"1402","name":"Kunshan Rural Credit Cooperative"},{"code":"1403","name":"Changshu Rural Commercial Bank"},{"code":"1404","name":"Shenzhen Rural Commercial Bank"},{"code":"1405","name":"Guangzhou Rural Commercial Bank"},{"code":"1408","name":"Foshan Shunde Rural Commercial Bank"},{"code":"1409","name":"Kunming Rural Credit Cooperative Union"},{"code":"1410","name":"Hubei Rural Credit Cooperative Union"},{"code":"1415","name":"Dongguan Rural Commercial Bank"},{"code":"1416","name":"Zhangjiagang Rural Commercial Bank"},{"code":"1417","name":"Fujian Rural Credit Cooperative Union"},{"code":"1418","name":"Beijing Rural Commercial Bank"},{"code":"1419","name":"Tianjin Rural Commercial Bank"},{"code":"1420","name":"Ningbo Yinzhou Rural Cooperative Bank"},{"code":"1424","name":"Jiangsu Rural Credit Cooperative Union"},{"code":"1428","name":"Wujiang Rural Commercial Bank"},{"code":"1430","name":"Bank of Suzhou"},{"code":"1443","name":"Guangxi Rural Credit Cooperative Union"},{"code":"1446","name":"Yellow River Rural Commercial Bank"},{"code":"1447","name":"Anhui Rural Credit Cooperative Union"},{"code":"1448","name":"Hainan Rural Credit Cooperative Union"},{"code":"1513","name":"Chongqing Rural Commercial Bank"},{"code":"6462","name":"Weifang Commercial Bank"},{"code":"6466","name":"Fudian Bank"},{"code":"6473","name":"Zhejiang Tailong Commercial Bank"},{"code":"6478","name":"Bank of Beibu Gulf"},{"code":"6567","name":"Shangqiu Commercial Bank"}]

Alright, problem solved. To be honest, it was quite fun to write code to solve a real problem I encountered for the first time. When you want to learn something or do something, don't overthink it. For example, don't prepare all the materials in advance, don't worry about whose materials are better, and don't worry about what will happen if you can't learn or do it well, etc. Then your passion will be gone and nothing will be accomplished. What we want is to do what we want to do immediately, just do it.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.