DataBricks and PySpark
by Mark Nielsen
Copyright August 2023
- Links
- Get DataBricks 14 day evaluation
- Saving passwords
- Expect and automation
Links
- https://www.databricks.com/try-databricks?#account
- https://spark.apache.org/docs/latest/api/python/index.html
- https://www.databricks.com/glossary/pyspark
- https://medium.com/analytics-vidhya/beginners-guide-on-databricks-spark-using-python-pyspark-de74d92e4885
- https://accounts.cloud.databricks.com
- https://www.databricks.com/product/aws
- https://www.databricks.com/resources/demos?itm_data=navbar_watchdemos
- https://www.databricks.com/resources/learn/training/lakehouse-fundamentals
- https://customer-academy.databricks.com/learn/catalog
- https://us-west-2.console.aws.amazon.com/cloudformation/home
if you choose us-west-2
- https://docs.databricks.com/en/administration-guide/workspace/quick-start.html
- https://docs.databricks.com/en/archive/dev-tools/cli/stack-cli.html
- https://docs.databricks.com/en/notebooks/index.html
- Troubleshooting : https://dbricks.co/AWSQuickStartHelp
DataBricks 14 day trial
- Have an aws account.
- Sign up for the 14 day trail https://www.databricks.com/try-databricks?#account
Make sure you select the 14 day trail.
- Get to the databricks dashboard.
Click on WorkSapces.
- Create a workspace called "ws1" with quickstart. Choose Oregon west.
- Choose the default name for the stack. Select Trial. Put in a password. Click Checkbox. Click Create.
- Now you are at the Cloudformation in AWS under stacks and the stack "s1".
NOTE: on the cloud formation page, if you delete and recreate a stack, refresh the page.
- What is a workspace?
https://docs.databricks.com/en/getting-started/concepts.html
- What is a stack? Stack refers to AWS.
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/stacks.html
- The stack in AWS and the workspace in DataBricks should exist.
- It may take 5 or 10 minutes to make workspace and stack.
- Once done, it will show in your databricks page. https://accounts.cloud.databricks.com/workspaces
- When done, go to your databricks page. Reload the page. If it is not done, it should say provisioning.
- You won't be able to assign your workspace to Unity or mestore under "Data" in the trial version.
If you have problems deleting and recreating workspaces...
Try to delete the workspace in DataBricks
- If that doesn't work, delete the stack in AWS, delete the workspace in DataBricks, delete "Credential configuration" and "Storage configuration" ONLY for that workspace. Don't delete others --- just for that workspace.
To get crendiatials for the next part
https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html
Create a token.
https://docs.databricks.com/en/dev-tools/auth.html
- In thte main DataBricks page, click o "ws1".
- Click on "Open WorkSpace". It will give you a new weboage for ws1.
- Under your account, on the top right hand side, click on your account name or email address.
- Click on user settings.
- Click on Access tokens.
- "Click on "Generate new token".
- Name the token, 90 days is fine, and click on "Generate".
- Copy the generated token.
- NOTE : this token acts as you (I think), beware if you are an admin account.
Get other crenditials -- the hostname and http address.
https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html
- In the main DataBricks page, click o "ws1".
- Click on "Open WorkSpace". It will give you a new weboage for ws1.
- Click on "compute"
- Click on the cluster. You should only gave one.
- Click Advanced options at the bottom of the page, you should see JDBC/ODBC.
- Record "Server Hostname" and "HTTP PATH".
odbc and python
I had to do this on my AWS EC2, because the version of Ubuntu was older and I had stuff running on it. I didn't want to upgrade.
https://docs.databricks.com/en/dev-tools/pyodbc.html
- Execute: mkdir databricks; cd databricks
- Download the Databricks ODBC driver.
- Execute: unzip SimbaSparkODBC-2.6.26.1045-Debian-64bit.zip
- Execute: apt install libsasl2-modules-gssapi-mit
- Execute: dpkg -i simbaspark_2.6.26.1045-2_amd64.deb
- Execute: apt-get install unixodbc
- Execute: pip install pyodbc
- Edit /etc/odbc.ini and file and replace host, PWD with the token, and HTTPPATH with your own info.
[ODBC Data Sources]
Databricks_Cluster = Simba Spark ODBC Driver
DB_host="dbc-f051dcab-XXXXX.cloud.databricks.com"
DB_http="sql/protocolv1/o/597530533501030/0817-214707-pl6aehzt"
DB_token="XXXXXXXb653624f279d80142a5b5e86a6b3e"
[Databricks_Cluster]
Driver = /opt/simba/spark/lib/64/libsparkodbc_sb64.so
Description = Simba Spark ODBC Driver DSN
HOST = dbc-f051dcab-XXXXX.cloud.databricks.com
PORT = 443
Schema = default
SparkServerType = 3
AuthMech = 3
UID = token
PWD = XXXXXX53624f279d80142a5b5e86a6b3e
ThriftTransport = 2
SSL = 1
HTTPPath = sql/protocolv1/o/597530533501030/0817-214707-pl6aehzt
- Execute: cp /etc/odbc.ini /usr/local/etc/odbc.ini
- Create file /etc/odbcinst.ini
[ODBC Drivers]
Simba SQL Server ODBC Driver = Installed
[Simba Spark ODBC Driver]
Driver = /opt/simba/spark/lib/64/libsparkodbc_sb64.so
- Excecute : cp /etc/odbcinst.ini /usr/local/etc/odbcinst.ini
Python driver for Ubuntu
I did this on a laptop. Installed latest Linut Mint which is based on the latest Ubuntu. Python worked for this.
https://docs.databricks.com/en/dev-tools/python-sql-connector.htm
First setup the env.
- Installed lastest Linux Mint which is latest Unbuntu. Needed Python 3.9 or later installed.
- Execute as root : pip install databricks-sql-connector
- Tested script, and got the error
/usr/lib/python3/dist-packages/requests/__init__.py:87: RequestsDependencyWarning: urllib3 (2.0.4) or chardet (4.0.0) doesn't match a supported version!
warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
- Executed : python3 -m pip install --upgrade requests
and got the message :
Not uninstalling requests at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'requests'. No files were found to uninstall.
Successfully installed charset-normalizer-3.2.0 requests-2.31.0
- Ran test script and got the message :
DatabricksRetryPolicy is currently bypassed. The CommandType cannot be set.
- Ignore that error, I need to figure out how to get rid of it. It is okay.
ODBC driver for Ubuntu
https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html
Installing Python and other software
https://docs.databricks.com/en/dev-tools/python-sql-connector.html
I am using Ubuntu EC2 server to connect.
- pip install databricks-sql-connector
- In the trail version, you cannot use Unity and associate your metadata to your workspace. We will have to upload data
and query it there.
Load data into your mysql server on EC2
You could use another source, like RDS MySQL or RDS Aurora, but in this case we are using an EC2 server runnning MySQL.