SSH for Data Scientists
June 19, 2023 · 5 min read
SSH is an important, yet commonly ignored, skill for Data Scientists. Most programs/courses won't touch on the topic, so here's a brief article to give you some foundational SSH skills.
Data Science is a vast and cumbersome field with a plethora of skills for practitioners to master. Like many others in this field, I quickly found out that there were some important skills that I was missing when I first started my career — One of those was SSH. I slowly learned about SSH early on in my career as I worked with Amazon SageMaker instances for training my models, along with other servers that I needed access to for my work.
Now as an MLOps Engineer, I support our Data Science & Machine Learning team, helping our Data Scientists and Machine Learning Engineers wherever I can. Over the last few years I have found myself often times helping others learn about SSH (as I once had to) and decided to write an article for my peers in this community. I hope this helps if you are new to the world of SSH!
What is SSH?
SSH stands for "Secure Shell". It's a network protocol that allows secure remote access to a computer over any network (so long as you have the necessary network access and permissions). It offers Data Scientists the ability to essentially log into a remote system (however large or small) and execute commands as if they were physically present at the machine.
For example, you can use an SSH connection to remotely operate a large server in the cloud (see How I Learned AWS) from your computer, regardless as to whether that server is right next to you or across the world. This is immensely helpful for Data Scientists that need access to large amounts of computational resources (i.e., RAM, CPU, storage, etc.) to perform their work. By using SSH to operate such hardware, you can train large models, store model artifacts, automate systems to regularly retrain models, have vastly improved network bandwidth, and so much more! The access to such resources is crucial and can all be facilitated through SSH.
Getting Started with SSH
First things first, you will need an SSH key pair. This is made up of two complimentary files, one called a "public key" and another called a "private key". You can generate a key pair using a terminal window with the following UNIX/LINUX command:
ssh-keygen -t ed25519 -f ~/.ssh/<NEW_KEY_NAME>
This will create your new SSH key files using ED25519, a modern digital signature algorithm, and will add them to your computer's ~/.ssh directory. The two files will be named with the <NEW_KEY_NAME> you provided, with one of them ending with a .pub file extension to signify which one the public key is.
After creating your SSH key pair you will need to run the following command. This restricts access to the private key file so that only you can read it.
chmod 400 ~/.ssh/<NEW_KEY_NAME>
How to use SSH?
With an SSH key pair, you can use the set for authenticating your local computer with remote servers. For example, you can use SSH to work with code repositories hosted on GitHub. In order to clone and work with remote repositories hosted on GitHub you will need to add your public key to your GitHub profile. Doing this allows the GitHub servers to verify your local computer's identity with its matching private key counterpart and authorizes your git pull/push commands.
As a data scientist, chances are you'll end up working with a remote server to utilize its hardware. Just like GitHub, you'll need to have your public key added to this server's own ~/.ssh to connect to it. After that is done, you can run the following command to connect, specifying the appropriate user name and IP/DNS address of the server.
ssh -i ~/.ssh/<NEW_KEY_NAME> <USER_NAME>@<SERVER>
This will place you inside the server as the user you specified in the command. From there you can issue any UNIX/LINUX commands within the terminal session like you would on any other computer (except Windows... blah!).
Now I recognize that working with just a command line isn't ideal, particularly for data science. Because of this, there are many ways to use your favorite IDE locally with your laptop while still utilizing the super powers of a remote server's hardware. For example, VS Code has amazing native features that allow remote connection over SSH to a remote server. Using the Remote - SSH extension will give you the ability to develop on any remote server with your VS Code application on your local computer as if you were at the server yourself.
What's in the ~/.ssh Directory?
After having worked with the above SSH commands, you'll notice that your private and public keys might not be the only files that exist in your computer's ~/.ssh directory. You'll most likely see three supporting file, such as the known_hosts file, authorized_keys file, and a config file.
known_hosts: The known_hosts file is used to store the public keys of remote hosts that you have connected to in the past. When you connect to a remote server using SSH, it presents its public key, and your SSH client checks if this key is already present in the known_hosts file. If the key matches, the connection proceeds without any warnings. If the key is not found or doesn't match, your SSH client may display a warning to alert you that the remote host's identity has changed or is unknown. You can manually edit the known_hosts file to remove or update entries.
authorized_keys: The authorized_keys file is found on the remote server and contains a list of public keys that are authorized to connect to user accounts on that server. Each line in this file typically represents a single public key. When you attempt to connect to a remote server using SSH, your SSH client sends your public key to the server, and the server checks if it matches any of the keys listed in the authorized_keys file. If there's a match, you are granted access without being prompted for a password. This file is crucial for key-based authentication, which is more secure and convenient than using passwords for SSH connections.
config: The config file allows you to customize and configure SSH client behavior on your local machine. It provides a way to define options and parameters for SSH connections. You can specify various settings such as hostname aliases, port numbers, user configurations, key-based authentication, and more. The config file helps simplify SSH connections by allowing you to define commonly used settings and aliases, eliminating the need to remember and type them every time you connect to a remote host.
These files collectively contribute to the security and convenience of SSH connections. The known_hosts file ensures that you are aware of the remote hosts you have previously connected to, while the authorized_keys file enables secure access to remote servers without passwords. The config file enhances the usability of SSH by providing a way to define custom settings and aliases for remote hosts.
Tips & Tricks When Using SSH
Now that you have some foundational knowledge about SSH, here are some final tips that are helpful to know and be aware of when using SSH:
#1 Create SSH command shortcuts
If you want to simplify your SSH workflow, you can utilize the ~/.ssh/config file to set up shortcuts and default configurations for your SSH connections. By doing this, you can establish connections using simplified commands, without specifying every detail each time.
For example, you can add a block of code for each of your servers in your ~/.ssh/config file like so:
Host myserver
HostName server.example.com
User myusername
IdentityFile ~/.ssh/myprivatekey
With this configuration, you can simply type ssh myserver in your terminal to connect to your server.
#2 Utilize SCP for file transfers
SCP (Secure Copy) is a command-line utility that works over SSH and enables you to securely transfer files between your local machine and remote servers. It provides a straightforward and secure method for moving files back and forth with remote servers, eliminating the need for additional tools or manual file transfer methods. For example, say you have a file called my_file.csv in a server's root directory that you need to move onto your local computer's home directory. You can use the following command to do so with SCP!
scp -i <PRIVATE_KEY> <USER_NAME>@<SERVER>:/my_file.csv ~/data
Alternatively, you can do the same thing but in the reverse direction, uploading a file from your computer onto a remote server.
scp -i <PRIVATE_KEY> ~/my_file.csv <USER_NAME>@<SERVER>:/path/to/destination/directory
#3 Leverage port-forwarding for remote UI access
SSH port-forwarding, also known as "SSH tunneling", allows you to securely access and interact with graphical user interfaces (GUIs or UIs) of various tools and applications running on remote servers. For example, if you have a Jupyter Notebook running on a remote server, you can use SSH port-forwarding to access the familiar user interfaces through a local browser window. This allows you to utilize the full functionality of these tools while they run on the remote server, providing a seamless experience.
The Jupyter service runs on port 8888 by default. With this, you can run the following command to create a port-forward connection. Once that is done, you can open a browser window and type localhost:8888 in the address.
ssh -i <PRIVATE_KEY> -L 8888:localhost:8888 <USER_NAME>@<SERVER>
#4 Be cautious when handling SSH keys
When transferring or storing SSH keys, avoid using insecure communication channels such as Slack, email, or unencrypted messaging services. Instead, use secure methods like encrypted file transfers or secure file sharing/storage platforms, like 1Password. The private key is the most sensitive part of your SSH key pair. Keep it confidential and never share it with anyone that you don't trust.
Copying and pasting the contents of your SSH keys, particularly the private key, can be finicky. You can use the following commands to copy the contents of your files to your computer's clipboard, avoiding manual error.
Using pbcopy (on macOS):
cat ~/.ssh/<PRIVATE_KEY> | pbcopy
Using xclip (on Linux — requires an install of the xclip package):
cat ~/.ssh/<PRIVATE_KEY> | xclip -selection clipboard
Then paste it!
If you found any of my content helpful, please consider donating
using one of the following options — Anything is appreciated!