Superset on Databricks by Nat Taylor

Superset on Databricks

Published on Jan 31, 2020.

We have data on S3 and SQL tables on it in Databricks, so I wanted to connect Superset for visualizing the data. Thanks to the databricks-dbapi project, it turns out to be as simple as pip install databricks-dbapi then pip install databricks-dbapi[sqlalchemy] and configuring a new Superset > Source > Database > SQLAlchemy URI to foo databricks+pyhive://token:<token>@<companyname>.cloud.databricks.com:443/<database>?cluster=<cluster_id>

Just keep in mind that:

Tokens are only available when you create them in Databricks. The “Token ID” shown on the “Access Tokens” page is just an ID, not the token itself.
cluster_id is in the middle of the cluster config url (/#/setting/clusters/1009-160350-indue40/configuration)
You need to restart Superset after you install the packages
Queries will be slow if they have to scan a lot of data, so consider partitioning on date and then restricting to just a few days.
You may use any SparkSQL built-in function like parse_url(url_col, 'HOST') or approx_count_distinct(userid)

Nat Taylor — Blog, AI, Product Management & Tinkering

Superset on Databricks

Post Navigation