Don’t let Python slither out of your control – how to load data science languages in and out of our relational database
A recent study of different job listings revealed which data science skills are most in demand for data scientists. It was no surprise that over 70% of all listings highlighted knowing Python as being a key technical skill. So, this got me thinking about what this means for you, when you’re working with data science languages with your relational database. In this blog, I’ll show you how easy it is for your team can to extract or load data into the database with their chosen programming language. As well as demonstrating why some methods are perhaps not quite as ‘optimal’.
How to connect Python to our relational database
There are several ways to read data from our relational database and write it back with the help of Python. Here’s a quick snapshot of the pros and cons of three common connection methods.
Our Python 2 package
We have a Python 2 package, which acts as a wrapper for the PyODBC package. You’re able to read data from the database directly into a Pandas data frame and also write data from a data frame into the database. This package only works with Python 2. And it’s worth nothing that with the upcoming support for Python2, this package is becoming more and more obsolete. You can find this package at the Exasol GitHub.
For Python 3, we offer a websocket API package. This package uses our Websocket API and doesn’t need any driver manager. Unfortunately, you’re not able to load data directly into an Pandas data frame and the performance isn’t the best for large data sets, but it’s fine when you only have a limited amount of data. It can also be found at the Exasol GitHub.
Badoo pyexasol package
The second fastest option for Python 3 is the Pyexaol package of Badoo. The package is now an official fork in the Exasol GitHub repository. It gives you the possibility to write into or out of a Pandas data frame, so it’s very flexible. And it’s available for Python 3. Additionally, there is handy manual with many example on how to use it. As this is what I’d call the go-to package, this rest of this blog will focus on its main functionality.
Install pyexasol package
Installing the Python package alone isn’t enough. There are several required and optional dependencies.The following core dependencies are all explained in detail in the manual – and are well worth a read :
- Exasol >= 6
- Python >= 3.6
- websocket client >= 0.47
It’s also recommend to use the Pyexasol package together with Pandas, as it simplifies the handling of huge data sets. The following command has to be executed to install Pyexasol and Pandas in your Python environment:
[code language="bash" gutter="false"] pip install pyexasol[pandas] [/code]
Execute SQL statements
I’ll explain the basic functionality of this package with a little example. Imagine I created an empty test table with a timestamp column, and when the current timestamp has been selected I wrote it back into this table.
To do this first you have to create a connection to our database:
[code language="python" gutter="false"] Con = pyexasol.connect(dsn=[IP-adress]:[port], user=[user], password = [password], schema = [schema], compression=True) [/code]
Executing a single query can be done with the execute() function of the connection object. This function returns an ExaStatement object, which could also be used to iterate over a result set of a SQL statement.
[code language="python" gutter="false"] Con.execute("create table sandbox.tbl_timestamp (cur_time timestamp) ") [/code]
Reading data from our relational database
It’s really intuitive and easy to export data into a pandas data frame when reading data from our database. The pyexasol package offers the Export_to_pandas() function, which returns a pandas data frame with the resulting code set passed as the following SQL statement:
code language="python" gutter="false"] data_frame = Con.export_to_pandas('select systimestamp') print(data_frame.head()) [/code]
Write data into our relational database
Writing data back into our relational database is as easy as reading data. You just have to use the import_to_pandas() function and provide the data frame and the target table. To do this you need to make sure the target table is located in the schema, and that you mention it as a parameter when you connect it to our database.
[code language="python" gutter="false"] Con.import_from_pandas(data_frame, 'TBL_TIMESTAMP') [/code] After this, you should close the existing connection. [code language="python" gutter="false"] Con.close [/code]
And the winning method is? Go direct into Pandas
As we’ve seen there are several ways to access our database through Python. But writing data directly into a Pandas data frame is highly recommended, as the data frames are easy to handle. And it can be used as an input for a TensorFlow model or other machine learning methods.