Part 1: Data Collection

Note:

I’m a big fan of use functional programming where appropriate, particularly with data transformations. This notebook serves as a way to guide you, the reader, through the process of the data, but it doesn’t serve as the primary structure for this project. As I am writing and working through the code, I will also be refactoring this into separate .py files for later usage. As a result, you’ll see that much of the code is organized into functions, even though it takes fewer lines to just write it in each cell.

Download Data

Our dataset comes from Julian McAuley and Wang-Cheng Kang at UC San Diego, hosted on Mcauley’s website. They have a handfule of datasets that were collected from Australian users, including reviews, user/item playtime, item metadata, and bundle data. We’re most interested in the V1 User-Items dataset, which describes the number of hours that users played on each of their games.

The file comes as a .json.gzip file, so we’ll need to extract the data before we can start processing it. More on that in the next section.

For now, we can download the raw, compressed data and save it to a local file. Let’s start by defining our input URL and output paths/directories.

from pathlib import Path

root_dir = Path.cwd().resolve().parent
raw_data_dir = root_dir / 'data' / 'raw'
raw_data_dir.mkdir(parents=True, exist_ok=True)

GZIP_PATH = raw_data_dir / 'ucsd_playtime.json.gz'
URL = 'https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'

print(f"Source URL: '{URL}'")
print(f"Outputting data to file: '{GZIP_PATH.relative_to(root_dir)}'")
Source URL: 'https://mcauleylab.ucsd.edu/public_datasets/data/steam/australian_users_items.json.gz'
Outputting data to file: 'data/raw/ucsd_playtime.json.gz'

Now we can make simple use of the requests library to download the data and write the raw bytes to the output file.

import requests


def download(verbose: bool) -> None:
    resp = requests.get(URL)
    resp.raise_for_status()
    with GZIP_PATH.open('wb') as file:
        size = file.write(resp.content)
    
    if verbose:
        size_mb = size / 1024**2
        print(f"Wrote {size_mb:,.1f} MB to '{GZIP_PATH.relative_to(root_dir)}'")
    return


download(verbose=True)
Wrote 70.6 MB to 'data/raw/ucsd_playtime.json.gz'

Extract from GZIP

Now to extract the data from the .json.gzip file into a JSON file:

import gzip

JSON_PATH = raw_data_dir / 'ucsd_playtime.json'


def extract_from_gzip(verbose: bool) -> None:
    chunk_size = 1048576   # 1 MB
    size = 0

    with gzip.open(GZIP_PATH) as gzip_file, JSON_PATH.open('wb') as json_file:
        while True:
            chunk = gzip_file.read(chunk_size)
            if not chunk:
                break
            size += json_file.write(chunk)

    if verbose:
        size_mb = size / 1024**2
        print(f"Wrote {size_mb:,.1f} MB to '{JSON_PATH.relative_to(root_dir)}'")
    return


extract_from_gzip(verbose=True)
Wrote 527.5 MB to 'data/raw/ucsd_playtime.json'

Handling the JSON-by-Line Structure

Let’s take a quick peek at the data, just the first 3 lines. For display purposes, I’m going to use the textwrap.shorten() function from the Python Standard Library so that the display looks cleaner.

import textwrap


with JSON_PATH.open('r', encoding='UTF-8') as file:
    n_lines = sum(1 for _ in file)
    file.seek(0)

    print(f'Total lines in the file: {n_lines:,}')
    print('First 3 lines:')
    print()

    for _ in range(3):
        print(textwrap.shorten(file.readline(), width=80))
Total lines in the file: 88,310
First 3 lines:

{'user_id': '76561197970982479', 'items_count': 277, 'steam_id': [...]
{'user_id': 'js41637', 'items_count': 888, 'steam_id': [...]
{'user_id': 'evcentric', 'items_count': 137, 'steam_id': [...]

If we took a naive approach and tried calling pd.read_json() or json.load() on this file, we’d run into an error. It may seem strange that the data is stored this way, particularly how it uses a single-quote ' instead of the JSON-required double-quote ". However, we must consider the source of this data: The dataset was likely obtained through a REST API, which is unlikely to return all the data at once (certainly not 88,310 records).

Most likely, the authors McAuley & Kang ran a script that repeatedly requested data from the API for all possible users. To save progress along the way, they likely appended each record to the JSON file after it was retrieved, preventing any loss of data if there was an error along the way.

While this helps on the original data collection side with McAuley/Kang, it creates a couple problems for us. If we want to get this data into a Pandas DataFrame, we’ll need to handle 2 sides:

  1. Since this is a JSON-by-Line format, each record must be handled individually, splitting records by a new line separator (either '\n' or '\r\n').
  2. The data is stored as plain-text, so reading a single line will return a string that must be converted into an appropriate Python data structure (in this case a dictionary)

A module that is helpful here is Python’s abstract syntax grammar module, ast. This module has a function ast.literal_eval() that will convert a string into the appropriate Python object. Even better, it also handles nested data well, which we’ll find useful after converting.

Let’s try evaluating a single object to see what this looks like. From some exploration on the side, I found line 51 to be a good example.

import ast

skip_lines = 50

with open(JSON_PATH, 'r', encoding='UTF-8') as file:
    # Skip lines until reached target line
    for _ in range(skip_lines):
        _ = file.readline()
    
    line = file.readline()
    json_object = ast.literal_eval(line)

json_object
{'user_id': 'Leaf_Light_Moscow',
 'items_count': 5,
 'steam_id': '76561198305694024',
 'user_url': 'http://steamcommunity.com/id/Leaf_Light_Moscow',
 'items': [{'item_id': '4000',
   'item_name': "Garry's Mod",
   'playtime_forever': 4548,
   'playtime_2weeks': 1729},
  {'item_id': '221100',
   'item_name': 'DayZ',
   'playtime_forever': 48,
   'playtime_2weeks': 0},
  {'item_id': '304930',
   'item_name': 'Unturned',
   'playtime_forever': 1736,
   'playtime_2weeks': 0},
  {'item_id': '323370',
   'item_name': 'TERA',
   'playtime_forever': 390,
   'playtime_2weeks': 0},
  {'item_id': '346120',
   'item_name': 'Boring Man - Online Tactical Stickman Combat',
   'playtime_forever': 3,
   'playtime_2weeks': 0}]}

We can see the following data structure present:

  • user_id: str
  • items_count: int
  • steam_id: str
  • user_url: str
  • items: list[dict], with nested dictionaries:
    • item_id: str
    • item_name: str
    • playtime_forever: int
    • playtime_2weeks: int

This nested structure will need to be handled in a later notebook. Fortunately, we can still keep the lists of dictionaries when we convert it to a DataFrame, since Pandas allows for nested/complex data types.

For now, let’s process the data without cleaning/expanding any of the data. Since this may take a little while, I’ll add a progress bar using the tqdm library.

import pandas as pd
from tqdm import tqdm


def parse_json_by_line(verbose: bool) -> pd.DataFrame:
    with JSON_PATH.open('r', encoding='utf-8') as file:
        if verbose:
            total = sum(1 for _ in file)
            file.seek(0)
            file = tqdm(file, 'Parsing JSON-by-Line', total, colour='green')
        
        records = []
        for line in file:
            line = line.strip()
            if not line:
                continue
            records.append(ast.literal_eval(line))

    df = pd.DataFrame.from_records(records)
    return df


df = parse_json_by_line(verbose=True)
df.info()
df.head()
Parsing JSON-by-Line: 100%|██████████| 88310/88310 [01:06<00:00, 1322.36it/s]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 88310 entries, 0 to 88309
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   user_id      88310 non-null  object
 1   items_count  88310 non-null  int64 
 2   steam_id     88310 non-null  object
 3   user_url     88310 non-null  object
 4   items        88310 non-null  object
dtypes: int64(1), object(4)
memory usage: 3.4+ MB
user_id items_count steam_id user_url items
0 76561197970982479 277 76561197970982479 http://steamcommunity.com/profiles/76561197970... [{'item_id': '10', 'item_name': 'Counter-Strik...
1 js41637 888 76561198035864385 http://steamcommunity.com/id/js41637 [{'item_id': '10', 'item_name': 'Counter-Strik...
2 evcentric 137 76561198007712555 http://steamcommunity.com/id/evcentric [{'item_id': '1200', 'item_name': 'Red Orchest...
3 Riot-Punch 328 76561197963445855 http://steamcommunity.com/id/Riot-Punch [{'item_id': '10', 'item_name': 'Counter-Strik...
4 doctr 541 76561198002099482 http://steamcommunity.com/id/doctr [{'item_id': '300', 'item_name': 'Day of Defea...

Processing this data took me a little over 1 minute. If we wanted to speed this up, we could look at running this in parallel. However, 1 minute isn’t horrible, and this is really a one-time process, so we can leave that out.

Save to File

Now that we have some data to start cleaning, we should save it so that we don’t have to run this process again. Typically, I’d look for a format like feather for quick reading/writing. However, this can cause problems since we have the items column as a nested list. For simplicity, we can use a pickle format to serialize our data, which has the least amount of requirements.

PICKLE_PATH = raw_data_dir / 'ucsd_playtime.pkl'
df.to_pickle(PICKLE_PATH)

size_mb = PICKLE_PATH.stat().st_size / 1024**2
rel_path = PICKLE_PATH.relative_to(root_dir)
print(f"Wrote {size_mb:,.2f} MB to file: '{rel_path}'")
Wrote 488.93 MB to file: 'data/raw/ucsd_playtime.pkl'

It’s unfortunate that our file size is so massive when using pickle, but this give us much better flexibility to stop here and resume cleaning in the next notebook.