A. Introduction
The FIBA World Cup 2023 [1] Men games are currently being played in Indonesia, Japan and the Philippines from August 25 to September 10. In this post, I will create a Playwright [2] script that will crawl the FIBA page to get the match results and save it to a csv file. This file can be used to generate an approximate live rating [4] of FIBA World Ranking, presented by Nike [3].
B. Page URL to crawl
The FIBA page has a url for all teams that participated in this world cup.
The base url is:
https://www.fiba.basketball/basketballworldcup/2023/
The team url has this pattern.
https://www.fiba.basketball/basketballworldcup/2023/team/{country}
Where countries can be 'Angola', 'Brazil', 'Cape-Verde', and others.
The page that we are interested in is the game and results tab. For example the USA team has the following url.
https://www.fiba.basketball/basketballworldcup/2023/team/USA#|tab=games_and_results
C. Playwright code
The header contains the following libraries.
"""Crawl match results of each team. Results are taken from FIBA men world cup 2023. """ import random import os.path import time from playwright.sync_api import sync_playwright import pandas as pd ...
We need to install playwright and pandas with:
pip install playwright pip install pandas
We also need a list of country names. This is used in building a url as each of them will be crawled.
COUNTRIES = ['Angola', 'Brazil', 'Cape-Verde', 'Cote-d-Ivoire', 'Egypt', 'France', 'Germany', 'Iran', 'Japan', 'Latvia', 'Lithuania', 'Montenegro', 'Philippines', 'Serbia', 'South-Sudan', 'USA', 'Australia', 'Canada', 'China', 'Dominican-Republic', 'Finland', 'Georgia', 'Greece', 'Italy', 'Jordan', 'Lebanon', 'Mexico', 'New-Zealand', 'Puerto-Rico', 'Slovenia', 'Spain', 'Venezuela']
And this is the url builder.
def build_url(country): """Creates a url from the given country.""" return f'https://www.fiba.basketball/basketballworldcup/2023/team/{country}#|tab=games_and_results'
Our entry point where the url builder is called.
# Entry point for c in COUNTRIES: url = build_url(c) ...
Our main function that does the work is 'get_results()'. Data are saved in a csv file under the 'data' folder.
# Entry point for c in COUNTRIES: url = build_url(c) cc = CTY_TO_IOC[c] fpath = f'./data/{cc}.csv' is_file = os.path.isfile(fpath) # Do not crawl twice. if is_file: continue df = get_results(url) ...
This is the function. We try to visit the page and if there is an error we attempt to visit it again.
# Define a function to get the matchup and team results def get_results(url): with sync_playwright() as p: # If there is failure, re-crawl. repeat = 0 while True: try: browser = p.chromium.launch(headless=False) page = browser.new_page() ua = USER_AGENTS[random.randint(0, 4)] page.set_extra_http_headers({"User-Agent": ua}) page.goto(url, timeout=30000) a = page.query_selector('div.schedule_list.gmt') b = a.query_selector('ul') c = b.query_selector_all('div.game_item') except Exception as exc: repeat += 1 print(f'{repr(exc)}') browser.close() time.sleep(SLEEP_TIME_REPEAT_INTERVAL) else: break ...
If all elements are present, proceed extracting the result info.
... while True: ... team_left, team_right, points = [], [], [] for g in c: tl = g.query_selector('table.country.left') d = tl.query_selector('div.name') team_left.append(d.inner_text()) pt = g.query_selector('table.points') divs = pt.query_selector_all('div.number') pp = [] for p in divs: n = p.inner_text() pp.append(n) points.append(pp) tr = g.query_selector('table.country.right') d = tr.query_selector('div.name') team_right.append(d.inner_text()) ...
Full code
The full source code below can also be found in my github page crawler [5] repository.
'matchup.py'"""Crawl match results of each team. Results are taken from FIBA men world cup 2023. """ import random import os.path import time from playwright.sync_api import sync_playwright import pandas as pd SLEEP_TIME_REPEAT_INTERVAL = 5 # SEC USER_AGENTS = [ 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.2227.0 Safari/537.36', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.3497.92 Safari/537.36', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36', 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36' ] COUNTRIES = ['Angola', 'Brazil', 'Cape-Verde', 'Cote-d-Ivoire', 'Egypt', 'France', 'Germany', 'Iran', 'Japan', 'Latvia', 'Lithuania', 'Montenegro', 'Philippines', 'Serbia', 'South-Sudan', 'USA', 'Australia', 'Canada', 'China', 'Dominican-Republic', 'Finland', 'Georgia', 'Greece', 'Italy', 'Jordan', 'Lebanon', 'Mexico', 'New-Zealand', 'Puerto-Rico', 'Slovenia', 'Spain', 'Venezuela'] CTY_TO_IOC = {'Angola': 'ANG', 'Brazil': 'BRA', 'Cape-Verde': 'CPV', 'Cote-d-Ivoire': 'CIV', 'Egypt': 'EGY', 'France': 'FRA', 'Germany': 'GER', 'Iran': 'IRI', 'Japan': 'JPN', 'Latvia': 'LAT', 'Lithuania': 'LTU', 'Montenegro': 'MNE', 'Philippines': 'PHI', 'Serbia': 'SRB', 'South-Sudan': 'SSD', 'USA': 'USA', 'Australia': 'AUS', 'Canada': 'CAN', 'China': 'CHN', 'Dominican-Republic': 'DOM', 'Finland': 'FIN', 'Georgia': 'GEO', 'Greece': 'GRE', 'Italy': 'ITA', 'Jordan': 'JOR', 'Lebanon': 'LBN', 'Mexico': 'MEX', 'New-Zealand': 'NZL', 'Puerto-Rico': 'PUR', 'Slovenia': 'SLO', 'Spain': 'ESP', 'Venezuela': 'VEN'} def build_url(country): """Creates a url from the given country.""" return f'https://www.fiba.basketball/basketballworldcup/2023/team/{country}#|tab=games_and_results' # Define a function to get the matchup and team results def get_results(url): with sync_playwright() as p: # If there is failure, re-crawl. repeat = 0 while True: try: browser = p.chromium.launch(headless=False) page = browser.new_page() ua = USER_AGENTS[random.randint(0, 4)] page.set_extra_http_headers({"User-Agent": ua}) page.goto(url, timeout=30000) a = page.query_selector('div.schedule_list.gmt') b = a.query_selector('ul') c = b.query_selector_all('div.game_item') except Exception as exc: repeat += 1 print(f'{repr(exc)}') browser.close() time.sleep(SLEEP_TIME_REPEAT_INTERVAL) else: break team_left, team_right, points = [], [], [] for g in c: tl = g.query_selector('table.country.left') d = tl.query_selector('div.name') team_left.append(d.inner_text()) pt = g.query_selector('table.points') divs = pt.query_selector_all('div.number') pp = [] for p in divs: n = p.inner_text() pp.append(n) points.append(pp) tr = g.query_selector('table.country.right') d = tr.query_selector('div.name') team_right.append(d.inner_text()) browser.close() data = [] for l, p, r in zip(team_left, points, team_right): if len(p): data.append([l, p[0], r, p[1]]) df = pd.DataFrame(data, columns=['C1', 'C1S', 'C2', 'C2S']) return df # Entry point for c in COUNTRIES: url = build_url(c) cc = CTY_TO_IOC[c] fpath = f'./data/{cc}.csv' is_file = os.path.isfile(fpath) # Do not crawl twice. if is_file: continue df = get_results(url) # Save so that we will not attempt to crawl it again later when there is failure. df.to_csv(fpath, index=False)
the command line is:
python matchup.py
You can rerun the script and those countries that already have information are not crawled. If there is a new result for a country, delete the saved '.csv' file before running the script.
D. Sample output
'ANG.csv'C1,C1S,C2,C2S ANG,67,ITA,81 PHI,70,ANG,80 ANG,67,DOM,75 ANG,76,CHN,83 ANG,78,SSD,101
The headers are 'C1' as country code at left of page, 'C1S' as score, 'C2' is the country code at right of page and its score is 'C2S'
In the first match, Angola lost to Italy with a score of 67-81. In the second game Angola won against the Philippines with a score of 70-80.
Each csv file can be combined to create a combined csv file where all results are saved.
'combine.py'import os.path import pandas as pd all_df = [] directory = './data' for filename in os.listdir(directory): if filename.endswith('.csv'): df = pd.read_csv(f'{directory}/{filename}') all_df.append(df) df1 = pd.concat(all_df, ignore_index=True) df1 = df1.drop_duplicates(keep='last') df1 = df1.reset_index(drop=True) df1.to_csv('combined_results.csv', index=False) print(df1)
Once they are combined, duplicates are also present and the script will delete those.
The printed dataframe looks like this.
C1 C1S C2 C2S 0 CAN 65 BRA 69 1 ANG 76 CHN 83 2 CIV 77 BRA 89 3 ANG 67 DOM 75 4 ESP 94 CIV 64 .. ... ... ... ... 75 SLO 100 VEN 85 76 VEN 75 CPV 81 77 GEO 70 VEN 59 78 JPN 86 VEN 77 79 FIN 90 VEN 75
E. References
[1]. FIBA World Cup 2023
[2]. Playwright - a library to automate browsers
[3]. FIBA World Ranking presented by Nike
[4]. How to estimate the live FIBA ranking and rating
[5]. Github Page Crawler repository
Comments
Post a Comment