Superset on Databricks
Published on .
We have data on S3 and SQL tables on it in Databricks, so I wanted to connect Superset for visualizing the data. Thanks to the databricks-dbapi project, it turns out to be as simple as pip install databricks-dbapi
then pip install databricks-dbapi[sqlalchemy]
and configuring a new Superset > Source > Database > SQLAlchemy URI to foo databricks+pyhive://token:<token>@<companyname>.cloud.databricks.com:443/<database>?cluster=<cluster_id>
Just keep in mind that:
- Tokens are only available when you create them in Databricks. The “Token ID” shown on the “Access Tokens” page is just an ID, not the token itself.
- cluster_id is in the middle of the cluster config url (
/#/setting/clusters/1009-160350-indue40/configuration
) - You need to restart Superset after you install the packages
- Queries will be slow if they have to scan a lot of data, so consider partitioning on date and then restricting to just a few days.
- You may use any SparkSQL built-in function like
parse_url(url_col, 'HOST')
orapprox_count_distinct(userid)