DataBricks and PySpark

by Mark Nielsen
Copyright August 2023

Links
Get DataBricks 14 day evaluation
Saving passwords
Expect and automation

DataBricks 14 day trial

Have an aws account.
Sign up for the 14 day trail https://www.databricks.com/try-databricks?#account
Make sure you select the 14 day trail.
Get to the databricks dashboard. Click on WorkSapces.
Create a workspace called "ws1" with quickstart. Choose Oregon west.
Choose the default name for the stack. Select Trial. Put in a password. Click Checkbox. Click Create.
Now you are at the Cloudformation in AWS under stacks and the stack "s1".
NOTE: on the cloud formation page, if you delete and recreate a stack, refresh the page.
What is a workspace? https://docs.databricks.com/en/getting-started/concepts.html
What is a stack? Stack refers to AWS. https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/stacks.html
The stack in AWS and the workspace in DataBricks should exist.
It may take 5 or 10 minutes to make workspace and stack.
Once done, it will show in your databricks page. https://accounts.cloud.databricks.com/workspaces
When done, go to your databricks page. Reload the page. If it is not done, it should say provisioning.
You won't be able to assign your workspace to Unity or mestore under "Data" in the trial version.

If you have problems deleting and recreating workspaces...

If that doesn't work, delete the stack in AWS, delete the workspace in DataBricks, delete "Credential configuration" and "Storage configuration" ONLY for that workspace. Don't delete others --- just for that workspace.

To get crendiatials for the next part
https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html

Create a token.
https://docs.databricks.com/en/dev-tools/auth.html

In thte main DataBricks page, click o "ws1".
Click on "Open WorkSpace". It will give you a new weboage for ws1.
Under your account, on the top right hand side, click on your account name or email address.
Click on user settings.
Click on Access tokens.
"Click on "Generate new token".
Name the token, 90 days is fine, and click on "Generate".
Copy the generated token.
NOTE : this token acts as you (I think), beware if you are an admin account.

Get other crenditials -- the hostname and http address.
https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html

In the main DataBricks page, click o "ws1".
Click on "Open WorkSpace". It will give you a new weboage for ws1.
Click on "compute"
Click on the cluster. You should only gave one.
Click Advanced options at the bottom of the page, you should see JDBC/ODBC.
Record "Server Hostname" and "HTTP PATH".

odbc and python

I had to do this on my AWS EC2, because the version of Ubuntu was older and I had stuff running on it. I didn't want to upgrade.

https://docs.databricks.com/en/dev-tools/pyodbc.html

Execute: mkdir databricks; cd databricks
Download the Databricks ODBC driver.
Execute: unzip SimbaSparkODBC-2.6.26.1045-Debian-64bit.zip
Execute: apt install libsasl2-modules-gssapi-mit
Execute: dpkg -i simbaspark_2.6.26.1045-2_amd64.deb
Execute: apt-get install unixodbc
Execute: pip install pyodbc

Edit /etc/odbc.ini and file and replace host, PWD with the token, and HTTPPATH with your own info.

[ODBC Data Sources]
Databricks_Cluster = Simba Spark ODBC Driver
DB_host="dbc-f051dcab-XXXXX.cloud.databricks.com"
DB_http="sql/protocolv1/o/597530533501030/0817-214707-pl6aehzt"
DB_token="XXXXXXXb653624f279d80142a5b5e86a6b3e"


[Databricks_Cluster]
Driver          = /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Description     = Simba Spark ODBC Driver DSN
HOST            = dbc-f051dcab-XXXXX.cloud.databricks.com
PORT            = 443
Schema          = default
SparkServerType = 3
AuthMech        = 3
UID             = token
PWD             = XXXXXX53624f279d80142a5b5e86a6b3e
ThriftTransport = 2
SSL             = 1
HTTPPath        = sql/protocolv1/o/597530533501030/0817-214707-pl6aehzt

Execute: cp /etc/odbc.ini /usr/local/etc/odbc.ini

Create file /etc/odbcinst.ini

[ODBC Drivers]
Simba SQL Server ODBC Driver = Installed

[Simba Spark ODBC Driver]
Driver = /opt/simba/spark/lib/64/libsparkodbc_sb64.so

Excecute : cp /etc/odbcinst.ini /usr/local/etc/odbcinst.ini

Python driver for Ubuntu

I did this on a laptop. Installed latest Linut Mint which is based on the latest Ubuntu. Python worked for this.

https://docs.databricks.com/en/dev-tools/python-sql-connector.htm

First setup the env.

Installed lastest Linux Mint which is latest Unbuntu. Needed Python 3.9 or later installed.
Execute as root : pip install databricks-sql-connector

Tested script, and got the error

      /usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.0.4) or chardet (4.0.0) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "

Executed : python3 -m pip install --upgrade requests
and got the message :

Not uninstalling requests at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'requests'. No files were found to uninstall.
Successfully installed charset-normalizer-3.2.0 requests-2.31.0

Ran test script and got the message :

DatabricksRetryPolicy is currently bypassed. The CommandType cannot be set.

Ignore that error, I need to figure out how to get rid of it. It is okay.

ODBC driver for Ubuntu

https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html

Installing Python and other software

https://docs.databricks.com/en/dev-tools/python-sql-connector.html

I am using Ubuntu EC2 server to connect.

pip install databricks-sql-connector
In the trail version, you cannot use Unity and associate your metadata to your workspace. We will have to upload data and query it there.

Load data into your mysql server on EC2

You could use another source, like RDS MySQL or RDS Aurora, but in this case we are using an EC2 server runnning MySQL.