A Python Data Science Workflow using Remote Servers

A good toolset and workflow can be a big productivity gain.  Unless you’re working exclusively on your personal computer for the whole pipeline, development workflows can quickly get messy.  I recently found a Python IDE that has great support for work on remote servers along with many other nice features:  PyCharm.  I’ll briefly walk through some of the basics that have provided a great relief to me.

First, it’s worth noting that to get remote server support you’ll need the Professional version, not the free Community one.  Currently, a year long subscription to the Professional version is $89 for an individual, and $199 per person for a business.  However, you get a 30 day free trial, so give it a shot.  The price also goes down for repeat customers.  In my opinion, that’s a bargain when you consider the tedium you’re going to avoid by using a smart tool.

Old Workflows

At work, our project needs data from hdfs and uses tools like Spark for parallel processing.  This means our code lives on an edge node, which we ssh into to execute.  This leaves a few workflow options, like:

  1. Develop locally, push code to a git server, and then pull changes to the remote server to run and test.  This can be a huge pain since you’ll either need to configure hadoop/spark/cluster-things on your computer or give up the ability to run anything locally.  If you decide to use this heavy-weight testing environment then you’ll need to worry about keeping all the versions current with your cluster’s upgrades.  It’s easy to imagine headaches due to code working in one environment and not the another.  It’s also just really slow to always push/pull for every little change you want to test.  Yuck.
  2. Develop pseudo-locally by editing remote files on your computer using something like ssh port forwarding.  I did this for awhile using the Atom text editor.  Although it wasn’t terrible, there were a few cases where I lost connection with the server without notification.  This meant I was saving changes to a temporary local file and ultimately lost my work when I came back the next day.  Also, there’s a big difference between a text editor and an IDE, despite them both doing key word highlighting.  More on that later.
  3. Develop directly on the edge node.  This made sure I wouldn’t lose any work with a lost connection, but it’s not always fun to use something like vim in a terminal.  Yeah, yeah, I know there’s all those hardcore developers that swear by vim/emacs and I probably just don’t know all they can do, but there are definite advantages and softer learning curves to modern IDEs.

New Workflow

The PyCharm approach is a combination of all three.  The files you’re editing are on your local machine.  The changes are automatically uploaded to the remote server through ftp/sftp.  If you were to lose connection, you still have the modified files tracked by your local git repo.  But both the terminal and python interpreter that are running within the IDE are actually running on the edge node via ssh.

So, you git clone your repo locally.  When you create a new PyCharm project, you set this local version as your root.  Instead of creating the necessary testing/virtual environment on your computer, you can configure a remote interpreter to just use the virtual environment installed on the remote server.

Configure Remote Interpreter

  • Preferences -> Expand your “Project: xyz” -> Project Interpreter
  • Click the gears symbol and then “Add Remote”
  • There are multiple options, but I configured via ssh.   Just enter the relevant credentials and if you have configured passwordless ssh login you’ll specify where to find your private key.
  • Provide the “Python Interpreter Path”, which should point to the python executable within your remote project’s virtual environment.
    • This will ensure you have all the packages you need when running python in the IDE without needing to install them locally

Configure Automatic Remote Deployment

Next you’ll want to configure the IDE to automatically upload local changes to the remote server.  This prevents the need to push/pull via an intermediary git server since the local and remote repos are always in sync.  You can then push to the git server from either location.  However, due to the great version control support within PyCharm, you’ll likely handle it locally within the IDE as well.

  • Click through Tools -> Deployment -> Configuration
  • On the Connection tab, setup the FTP/SFTP connection.  This is likely very similar to your remote interpreter setup.
    • For Root path I needed to specify my login’s local path to home.  This should only be important in as much as you’re consistent with the Mappings tab
  • On the Mappings tab, tell the IDE how to map local to remote files.  Your Local path will point to your project’s local directory and your Deployment path will be relative to the root path you specified in the Connection tab.
  • To automatically deploy changes, click Tools -> Deployment -> Automatic Upload

View/Edit Remote Files Directly

If you’re doing normal development work you should instead view and edit your local version of the files since they will automatically sync with the remote server.  However, sometimes you might need to view remote files directly, e.g. program output.  In that case, simply click Tools -> Deployment -> Browse Remote Host.

General Feature Highlights

Here’s a few other features I’ve run into that have been really helpful.

  • Start a terminal on your remote machine within the IDE by clicking Tools -> Start SSH Session
  • Open the remote Python interpreter within the IDE by clicking the Python Console button at the bottom
  • If you’re using Django or another framework, get deep integration with the tool by going to Preferences -> Languages & Frameworks.
    • In the case of Django you can run manage.py sessions that will do things like autocomplete available options (i.e. runserver, makemigrations).
  • Configure a database, enabling built-in support for executing SQL queries or viewing schema, by clicking View -> Tool Windows -> Database
  • Get code quality inspections by right clicking a file in the project pane on the left and choosing Inspect Code…
  • Automatically navigate to the code that defines a class by pressing Cmd/Ctrl+O and typing in the name of the class
  • Use either the Version Control button at bottom left or Git tools on bottom right to manage your commits and branches
  • Explore the Code/Refactor options to intelligently modify your code (e.g. highlighting a code block and having the IDE automatically refactor it into a class method)
  • Enjoy all the code completion and auto-navigation that a text editor won’t give you

These are the cool, productive features I’ve uncovered in just a few days of playing with PyCharm, but the seamless integration with common frameworks, databases, and remote servers already make it well worth the modest expense.

Leave a Reply

Your email address will not be published. Required fields are marked *