Get FIBA World Cup 2023 match result using Playwright

crawl fiba match results

A. Introduction

The FIBA World Cup 2023 [1] Men games are currently being played in Indonesia, Japan and the Philippines from August 25 to September 10. In this post, I will create a Playwright [2] script that will crawl the FIBA page to get the match results and save it to a csv file. This file can be used to generate an approximate live rating [4] of FIBA World Ranking, presented by Nike [3].

B. Page URL to crawl

The FIBA page has a url for all teams that participated in this world cup.

The base url is:

https://www.fiba.basketball/basketballworldcup/2023/

The team url has this pattern.

https://www.fiba.basketball/basketballworldcup/2023/team/{country}

Where countries can be 'Angola', 'Brazil', 'Cape-Verde', and others.

The page that we are interested in is the game and results tab. For example the USA team has the following url.

https://www.fiba.basketball/basketballworldcup/2023/team/USA#|tab=games_and_results
fiba team url

C. Playwright code

The header contains the following libraries.

"""Crawl match results of each team.

Results are taken from FIBA men world cup 2023.
"""


import random
import os.path
import time

from playwright.sync_api import sync_playwright
import pandas as pd

...

We need to install playwright and pandas with:

pip install playwright
pip install pandas

We also need a list of country names. This is used in building a url as each of them will be crawled.

COUNTRIES = ['Angola', 'Brazil', 'Cape-Verde', 'Cote-d-Ivoire', 'Egypt',
             'France', 'Germany', 'Iran', 'Japan', 'Latvia', 'Lithuania',
             'Montenegro', 'Philippines', 'Serbia', 'South-Sudan', 'USA',
             'Australia', 'Canada', 'China', 'Dominican-Republic',
             'Finland', 'Georgia', 'Greece', 'Italy', 'Jordan', 'Lebanon',
             'Mexico', 'New-Zealand', 'Puerto-Rico', 'Slovenia', 'Spain',
             'Venezuela']

And this is the url builder.

def build_url(country):
    """Creates a url from the given country."""
    return f'https://www.fiba.basketball/basketballworldcup/2023/team/{country}#|tab=games_and_results'

Our entry point where the url builder is called.

# Entry point
for c in COUNTRIES:
    url = build_url(c)

...

Our main function that does the work is 'get_results()'. Data are saved in a csv file under the 'data' folder.

# Entry point
for c in COUNTRIES:
    url = build_url(c)
    cc = CTY_TO_IOC[c]

    fpath = f'./data/{cc}.csv'
    is_file = os.path.isfile(fpath)

    # Do not crawl twice.
    if is_file:
        continue

    df = get_results(url)

...

This is the function. We try to visit the page and if there is an error we attempt to visit it again.

# Define a function to get the matchup and team results
def get_results(url):
    with sync_playwright() as p:

        # If there is failure, re-crawl.
        repeat = 0
        while True:
            try:
                browser = p.chromium.launch(headless=False)
                page = browser.new_page()
                ua = USER_AGENTS[random.randint(0, 4)]
                page.set_extra_http_headers({"User-Agent": ua})
                page.goto(url, timeout=30000)
                a = page.query_selector('div.schedule_list.gmt')
                b = a.query_selector('ul')
                c = b.query_selector_all('div.game_item')        
            except Exception as exc:
                repeat += 1
                print(f'{repr(exc)}')
                browser.close()
                time.sleep(SLEEP_TIME_REPEAT_INTERVAL)
            else:
                break        

...

If all elements are present, proceed extracting the result info.

...

while True:

...
      
team_left, team_right, points = [], [], []
for g in c:
    tl = g.query_selector('table.country.left')
    d = tl.query_selector('div.name')
    team_left.append(d.inner_text())
    pt = g.query_selector('table.points')
    divs = pt.query_selector_all('div.number')
    pp = []
    for p in divs:
        n = p.inner_text()
        pp.append(n)
    points.append(pp)
    tr = g.query_selector('table.country.right')
    d = tr.query_selector('div.name')
    team_right.append(d.inner_text())

...

Full code

The full source code below can also be found in my github page crawler [5] repository.

'matchup.py'
"""Crawl match results of each team.

Results are taken from FIBA men world cup 2023.
"""


import random
import os.path
import time

from playwright.sync_api import sync_playwright
import pandas as pd


SLEEP_TIME_REPEAT_INTERVAL = 5  # SEC


USER_AGENTS = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.2227.0 Safari/537.36',
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.3497.92 Safari/537.36',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/110.0.0.0 Safari/537.36',
  'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
]

COUNTRIES = ['Angola', 'Brazil', 'Cape-Verde', 'Cote-d-Ivoire', 'Egypt',
             'France', 'Germany', 'Iran', 'Japan', 'Latvia', 'Lithuania',
             'Montenegro', 'Philippines', 'Serbia', 'South-Sudan', 'USA',
             'Australia', 'Canada', 'China', 'Dominican-Republic',
             'Finland', 'Georgia', 'Greece', 'Italy', 'Jordan', 'Lebanon',
             'Mexico', 'New-Zealand', 'Puerto-Rico', 'Slovenia', 'Spain',
             'Venezuela']

CTY_TO_IOC = {'Angola': 'ANG', 'Brazil': 'BRA', 'Cape-Verde': 'CPV', 'Cote-d-Ivoire': 'CIV',
              'Egypt': 'EGY', 'France': 'FRA', 'Germany': 'GER', 'Iran': 'IRI', 'Japan': 'JPN',
              'Latvia': 'LAT', 'Lithuania': 'LTU', 'Montenegro': 'MNE', 'Philippines': 'PHI',
              'Serbia': 'SRB', 'South-Sudan': 'SSD', 'USA': 'USA', 'Australia': 'AUS',
              'Canada': 'CAN', 'China': 'CHN', 'Dominican-Republic': 'DOM',
              'Finland': 'FIN', 'Georgia': 'GEO', 'Greece': 'GRE', 'Italy': 'ITA',
              'Jordan': 'JOR', 'Lebanon': 'LBN', 'Mexico': 'MEX',
              'New-Zealand': 'NZL', 'Puerto-Rico': 'PUR', 'Slovenia': 'SLO',
              'Spain': 'ESP', 'Venezuela': 'VEN'}


def build_url(country):
    """Creates a url from the given country."""
    return f'https://www.fiba.basketball/basketballworldcup/2023/team/{country}#|tab=games_and_results'


# Define a function to get the matchup and team results
def get_results(url):
    with sync_playwright() as p:

        # If there is failure, re-crawl.
        repeat = 0
        while True:
            try:
                browser = p.chromium.launch(headless=False)
                page = browser.new_page()
                ua = USER_AGENTS[random.randint(0, 4)]
                page.set_extra_http_headers({"User-Agent": ua})
                page.goto(url, timeout=30000)
                a = page.query_selector('div.schedule_list.gmt')
                b = a.query_selector('ul')
                c = b.query_selector_all('div.game_item')        
            except Exception as exc:
                repeat += 1
                print(f'{repr(exc)}')
                browser.close()
                time.sleep(SLEEP_TIME_REPEAT_INTERVAL)
            else:
                break        

        team_left, team_right, points = [], [], []
        for g in c:
            tl = g.query_selector('table.country.left')
            d = tl.query_selector('div.name')
            team_left.append(d.inner_text())

            pt = g.query_selector('table.points')
            divs = pt.query_selector_all('div.number')
            pp = []
            for p in divs:
                n = p.inner_text()
                pp.append(n)
            points.append(pp)

            tr = g.query_selector('table.country.right')
            d = tr.query_selector('div.name')
            team_right.append(d.inner_text())

        browser.close()

        data = []
        for l, p, r in zip(team_left, points, team_right):
            if len(p):
                data.append([l, p[0], r, p[1]])

        df = pd.DataFrame(data, columns=['C1', 'C1S', 'C2', 'C2S'])
        return df


# Entry point
for c in COUNTRIES:
    url = build_url(c)
    cc = CTY_TO_IOC[c]

    fpath = f'./data/{cc}.csv'
    is_file = os.path.isfile(fpath)

    # Do not crawl twice.
    if is_file:
        continue

    df = get_results(url)

    # Save so that we will not attempt to crawl it again later when there is failure.
    df.to_csv(fpath, index=False)   

the command line is:

python matchup.py

You can rerun the script and those countries that already have information are not crawled. If there is a new result for a country, delete the saved '.csv' file before running the script.

D. Sample output

'ANG.csv'
C1,C1S,C2,C2S
ANG,67,ITA,81
PHI,70,ANG,80
ANG,67,DOM,75
ANG,76,CHN,83
ANG,78,SSD,101

The headers are 'C1' as country code at left of page, 'C1S' as score, 'C2' is the country code at right of page and its score is 'C2S'

In the first match, Angola lost to Italy with a score of 67-81. In the second game Angola won against the Philippines with a score of 70-80.

Each csv file can be combined to create a combined csv file where all results are saved.

'combine.py'
import os.path
import pandas as pd


all_df = []
directory = './data'

for filename in os.listdir(directory):
    if filename.endswith('.csv'):
        df = pd.read_csv(f'{directory}/{filename}')
        all_df.append(df)

df1 = pd.concat(all_df, ignore_index=True)
df1 = df1.drop_duplicates(keep='last')
df1 = df1.reset_index(drop=True)
df1.to_csv('combined_results.csv', index=False)
print(df1)

Once they are combined, duplicates are also present and the script will delete those.

The printed dataframe looks like this.

     C1  C1S   C2  C2S
0   CAN   65  BRA   69
1   ANG   76  CHN   83
2   CIV   77  BRA   89
3   ANG   67  DOM   75
4   ESP   94  CIV   64
..  ...  ...  ...  ...
75  SLO  100  VEN   85
76  VEN   75  CPV   81
77  GEO   70  VEN   59
78  JPN   86  VEN   77
79  FIN   90  VEN   75

E. References

[1]. FIBA World Cup 2023
[2]. Playwright - a library to automate browsers
[3]. FIBA World Ranking presented by Nike
[4]. How to estimate the live FIBA ranking and rating
[5]. Github Page Crawler repository

Comments