Dalgo

Bookmark this to keep an eye on my project updates!

View My GitHub Profile

Dalgo

Introduction

Dalgo is an open-source data platform for the social sector. Dalgo is maintained by Project Tech4Dev which hosts a commercial multi-tenant instance at https://dashboard.dalgo.in/.

Dalgo tenants are called organizations. In our commercial offering, most organizations are client NGOs using our platform. We also have an organization for demos, and an organization for internal dashboarding. The platform does not distinguish between these in terms of functionality.

Every organization has an associated warehouse. We currently support Postgres and BigQuery warehouses, and could potentially support Snowflake and Redshift.

Dalgo allows an organization to configure multiple data sources and configure a frequency upon which to ingest data from those sources into their warehouse. We do this under the hood using Airbyte.

Once the data is in the warehouse it typically undergoes several journeys to ready it for its various consumers. This is typically done using SQL and Dalgo supports this via dbt.

Our support for different warehouse types is restricted only by what Airbyte and dbt support.

Architecture

Dalgo uses Prefect to manage job requirements across its tenants. This means that all Airbyte sync jobs and all dbt run jobs are queued into and executed by Prefect. Prefect then runs jobs on Airbyte and dbt and makes the logs available to the Dalgo platform.

Dalgo Infrastructure

The Dalgo backend is a Django application which serves requests made from a React frontend. The backend communicates with Prefect via a lightweight proxy since Prefect’s Python SDK is all async and Django seems to support async Python only half-heartedly.

The backend maintains organizations and users in a database, and handles email communication to these users when required. Most users belong to only one organization, but the Dalgo team and our implementation partners are usually connected to more than one, and the platform lets us switch context under the same login.

Every organization has its own dedicated Airbyte workspace, which acts as a container for its data sources and its data warehouse. Prefect has no corresponding support, so we attach metadata to Prefect configurations to help us track which organizations they belong to. The backend’s database tracks every Prefect deployment against its organization, and the platform retrieves flow runs and associated logs through each deployment.

For Prefect to run dbt jobs, it requires dbt credentials to exist within a dbt cli profile. The platform stores a reference to this profile for every organization.

Dalgo uses Prefect deployments in two ways:

  1. For users to trigger an Airbyte sync or a dbt run from the UI
  2. For Orchestration Pipelines, which can have a series of steps and usually run on a schedule

In general, we create a Prefect deployment for a task if we want to provide a history of logs for that task. We don’t create deployments for

Logs for these commands are displayed to the user when triggered from the UI, but are lost when they navigate away from the page.

Users and Roles

We use Django’s auth system and their standard User model. Authentication is done via django-rest-framework using Token Authentication. Every email address maps to a unique User object.

Membership in an Org is tracked via an OrgUser object, which is essentially a User, an Org and a role. The platform currently has five roles:

  1. Super Admin (for T4D employees who manage our Dalgo installation)
  2. Account Manager
  3. Pipeline Manager
  4. Analyst
  5. Guest

All Django API endpoints are decorated with a has_permission and a list of required permissions, for example

@pipelineapi.post("v1/flows/", auth=auth.CustomAuthMiddleware())
@has_permission(["can_create_pipeline"])
def post_prefect_dataflow_v1(request, payload: PrefectDataFlowCreateSchema4):
    """Create a prefect deployment i.e. a ddp dataflow"""
    ...

Celery

Some tasks which are triggered by http requests take too long to complete. We outsource these to Celery which provides us a task id using which we are able to poll the status of the task.

Celery is also used to schedule regular maintenance. One example is to release TaskLocks which are not released during the normal cleanup proces.