Introduction

When working in FinTech, dealing with vast amounts of data, in all its shapes and sizes, can be a real pain point.

Data can be wildly different between providers, it often requires cleaning and can be hard to reconcile against other sources of truth.

API conventions differ between providers, the data often requires transformations, and it can be hard to quickly navigate to the point of failure, to name a few.

If you have never encountered the data science acronym ETL, it stands for:

Extract
Transform
Load

They are general processes that acquire some data, clean it in preparation for use, and then load it into memory or storage for the end user.

At OpenBB, we conquer data with a TET (Transform—Extract—Transform) pipeline.

The OpenBB Platform plays the role of the staging area and divides the process for fetching data into three steps — TET:

Transform the query.
Extract the data.
Transform the data.

The TET Pipeline

The TET pipeline is contained to the Fetcher class, and it is the main executor possessing the logic for how it should behave. Each data source and provider inherits from this class and sets the QueryParams and Data definitions as the instructions for execution and validation.

It begins with QueryParams, which are Pydantic models of parameters for querying a specific provider endpoint. The model is validated from the user-supplied input. For example,

1
obb.equity.price.historical("aapl", start_date="2024-01-01", provider="yfinance")

The function inputs go into the first stage of the pipeline, Transform Query.

1. Transform Query

Every provider has different words for the same thing, but OpenBB translates them in a standardized interface.

Furthermore, some may require pagination to get all the data requested, and others may not allow filtering the request directly.

There are incorrectly documented parameters and response types to deal with; worse yet, undocumented. It can feel like they don’t even want you to know how to use the service they provide.

The table below compares a few basic parameter names for historical stock prices.

Parameter	Polygon	Intrinio	FMP
symbol	stocksTicker	identifier	symbol
start_date	from	start_date	from
end_date	to	end_date	to
interval	multiplier + timespan	Interval_size	timeframe
Requires Pagination	Yes	Yes	Unclear
Max Results Per Page	50,000	10,000 for daily/monthly/weekly but 500 for intraday	Unclear, maybe limited via date range
Additional Considerations	Interval can be any combination of multiplier and minute/day/week/month/year	Different endpoints for intraday and daily/weekly/monthly. Deviation from conventional meaning of, intraday	Endpoints are poorly labeled and documented.
Request Structure	Standard URL query string	Standard URL query string	Same as Intrinio

Parameter	Polygon	Intrinio	FMP
Interval	multiplier =60 timespan=”minute”	interval_size=”1h”	timeframe=”1hour”

Similar differences will exist between the accepted values for these parameters. What is the syntax for a one-hour interval?

The OpenBB Platform smooths these differences, seamlessly switching between sources.

Comparing multiple sources of data is as simple as changing the “provider” parameter.

1
2
3
4
5
6
7
8
9
10
11
import pandas as pd
from openbb import obb
compare = pd.DataFrame()
for provider in ["yfinance", "alpha_vantage", "tiingo", "polygon", "intrinio", "fmp"]:
    compare[provider] = obb.equity.price.historical(
        "AAPL",
        start_date="2024-03-03",
        end_date="2024-03-15",
        interval="1d",
        provider=provider
    ).to_df().get("close")

2. Extract Data

When you call

1
2
3
4
5
6
7
obb.equity.price.historical(
    “AAPL”, 
    interval=”1d”,
    start_date=”2024-03-03”,
    end_date=”2024-03-15”,
    provider=”yfinance”
)

a standardized object is returned.

Once the parameters are molded to fit a particular provider-specific schema, they are passed on to the second stage which is the Extract Data stage.

This stage is where the bulk of the work gets done, grabbing the data from the provider. The transformed parameters are used to query the source and return the raw data.

Depending on the type and complexity of the data, multiple API calls may be required. The final results are a concatenation of each request, typically served as a list of dictionaries on output.

Some light parsing and shaping may be required during this stage, but generally, the data is still “as-is” from the provider. Breaking up the process here allows the developer to understand if a point of failure is within the data request or if it is an error within the data itself.

The sample data below is from Polygon. Here, “results” are what will be passed through to the next stage.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
{
  "adjusted": true,
  "queryCount": 3,
  "results": [
    {
      "T": "KIMpL",
      "c": 25.9102,
      "h": 26.25,
      "l": 25.91,
      "n": 74,
      "o": 26.07,
      "t": 1602705600000,
      "v": 4369,
      "vw": 26.0407
    },
    {
      "T": "TANH",
      "c": 23.4,
      "h": 24.763,
      "l": 22.65,
      "n": 1096,
      "o": 24.5,
      "t": 1602705600000,
      "v": 25933.6,
      "vw": 23.493
    },
    {
      "T": "VSAT",
      "c": 34.24,
      "h": 35.47,
      "l": 34.21,
      "n": 4966,
      "o": 34.9,
      "t": 1602705600000,
      "v": 312583,
      "vw": 34.4736
    }
  ],
  "resultsCount": 3,
  "status": "OK"
}

3. Transform Data

The Transform Data stage is where the finishing touches are made. Much like the parameters, data fields will vary greatly between providers. Standardized names and type enforcement are applied here.

The sample data above would be mapped to any standard fields using the __alias_dict__ property of the Pydantic model. Extra, provider-specific, fields are then defined below in the model.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
 class PolygonEquityHistoricalData(EquityHistoricalData):
    """Polygon Equity Historical Price Data."""
    __alias_dict__ = {
        "date": "t",
        "open": "o",
        "high": "h",
        "low": "l",
        "close": "c",
        "volume": "v",
        "vwap": "vw",
    }
    transactions: Optional[PositiveInt] = Field(
        default=None,
        description="Number of transactions for the symbol in the time period.",
        alias="n",
    )

As part of a general ETL pipeline, the output of this process fits somewhere near the end of the “Transform” stage. It may need further transformation to conform with any specific requirements, but most of the annoying details have been taken care of, such as:

Data is guaranteed to be JSON serializable.
Data is delivered as a validated Pydantic model.
Types are strictly enforced.
Numbers are numbers, dates are dates, and strings are strings.
NaN, empty strings, and the various string representations of None are converted to null.
Field names are always in “lower_snake_case” and have standardized names across all data sources, wherever possible.
OHLC+V will always be: “open”, “high”, “low”, “close”, and “volume”.

When you start comparing the same data between several sources, the transformation is what makes it possible. The OpenBB Platform takes care of the details so as a user, you don’t have to worry about whether a particular dataset returns numbers as a string or figure out which format the date is stored in and how to parse it.

Why TET?

Working with TET allowed us to understand its strengths and weaknesses better.

Pros of TET:

It helps us build and ship faster by having a clear and defined structure to follow.
It allows us to test each part of the Data retrieval logic.
It segregates errors, making them easier to spot and deal with.
It keeps the codebase organized and tidy.
It introduces ease of extendability and modularity.

Cons of TET:

It can be an overly verbose approach for quick implementations.
It heavily relies on the data standardization framework.
Some steps can be redundant (e.g., not all data or parameters require transformations)