Interacting with AWS S3 using Python in a Jupyter notebook

It has been a long time since I’ve last posted anything. I must admit that it is only partly because I’m busy trying to finish my PhD in my spare time. Sometimes I’ve also felt a bit too lazy to use up what little time I have left over to write a post. But it has been too long now!

Lately at my job I’ve been working a lot with Amazon Web Services’ (AWS) Simple Storage Solution (S3), which provides cloud-based file storage. I have also been meaning to dive more into using Jupyter notebooks, which are very useful in data science. I decided to create the content for this post, which will focus on setting up AWS and using S3, in a Jupyter notebook, which I then converted to HTML and uploaded to my blog. Originally I started to write this post using Colaboratory, which is an online Jupyter extension by Google. However, once I got to the point of accessing S3 via the Python SDK, I realized that I would need to somehow provide my credentials. Given that Colaboratory is still under development, I’m not confident enough that I can securely connect to S3 there, so I switched back to the original Jupyter notebook. You can find a fairly in-depth description of what Jupyter notebooks are and how to use them here. An important component of making notebooks is writing descriptions in markdown, for which I found this cheatsheet to be quite helpful.

This notebook is available on github.

Setting up AWS

In order to interact with AWS I first of all need my own instance of AWS. AWS has a free tier of services. This means that you can use the services for free, up to certain monthly limits. However, once those limits are surpassed, you will be charged for usage. Because of this, AWS requires that you provide payment information upon signing up.

I must be honest that I am not completely comfortable having to provide my payment information up front. I would much rather that my AWS resources simply become frozen when I’ve reached my monthly limit, at least while I am just learning about AWS. But, I do want to experiment, so I ended up providing my payment information. I set up the billing alarm, which will notify me when I reach the free limit. As far as I know, there is no way to automatically freeze resources that will incur charges, so I guess for now I just need to be extra careful.

Working with S3 web interface

To start using S3 I used the web interface to set it up and load a sample file, following these directions. I followed the directions exactly, which was straightforward. While working with this interface is nice, what is perhaps more interesting for programmers is the command line interface (CLI) and programmatic access (I will focus on Python).

Working with S3 via the CLI and Python SDK

Before it is possible to work with S3 programmatically, it is necessary to set up an AWS IAM User. This guide shows how to do that, plus other steps necessary to install and configure AWS. To work with with Python SDK, it is also necessary to install boto3 (which I did with the command pip install boto3). Below I will demonstrate the SDK, along with the equivalent commands in the CLI. First, however, we need to import boto3 and initialize and S3 object.

In [26]:
import boto3, os

s3 = boto3.resource('s3')

Note: You can also do s3 = boto3.client('s3'), but some functionality won’t be possible (like s3.Bucket()).

Creating a bucket

Bucket names need to be globally unique, meaning that no two buckets can have the same name, not even when they are owned by different users.

With the CLI

In [27]:
! aws s3 mb s3://demo-bucket-cdl
make_bucket: demo-bucket-cdl

With the SDK

In [28]:
s3.create_bucket(Bucket='demo-bucket-cdl2')
Out[28]:
s3.Bucket(name='demo-bucket-cdl2')

Upload file

I first made a small test file and made a copy with this command
echo test file > test.txt

With the CLI

In [29]:
! aws s3 cp test.txt s3://demo-bucket-cdl/
upload: ./test.txt to s3://demo-bucket-cdl/test.txt              

The command above copies a file, but you can also move files using mv. I haven’t explicitly included the filename on the S3 end, which will result in the file having the same name as the original file. You can also explicitly tell S3 what the file name should be, including subfolders without creating the subfolders first (in fact, subfolders to not exist on S3 in the way that they do in other file systems).

With the SDK

Let’s upload the file twice, one in a subdirectory.

In [30]:
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'test2.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'subdir/test3.txt')

Note that if s3 was a client instead of a resource, the command becomes s3.upload_file('test.txt', 'demo-bucket-cdl2', 'test2.txt')

Deleting objects

With the CLI

In [31]:
! aws s3 rm s3://demo-bucket-cdl/test.txt
delete: s3://demo-bucket-cdl/test.txt

With the SDK

We can delete the object via the client.

In [32]:
s3.meta.client.delete_object(Bucket="demo-bucket-cdl2", Key="test2.txt")
Out[32]:
{'ResponseMetadata': {'HTTPHeaders': {'date': 'Wed, 22 Nov 2017 19:54:40 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': 'c/OUm5MoDWUZgVmdf3ojOdTfE725yJfQ0Fx4Ye74vTQJ+7fCVKwQIPqweIqgHw6al9Wc9+N77gc=',
   'x-amz-request-id': '0B084EEA55655C6B'},
  'HTTPStatusCode': 204,
  'HostId': 'c/OUm5MoDWUZgVmdf3ojOdTfE725yJfQ0Fx4Ye74vTQJ+7fCVKwQIPqweIqgHw6al9Wc9+N77gc=',
  'RequestId': '0B084EEA55655C6B',
  'RetryAttempts': 0}}

But we can also delete an object that has already been retrieved from s3.

In [33]:
obj = s3.Object("demo-bucket-cdl2", "subdir/test3.txt")
obj.delete()
Out[33]:
{'ResponseMetadata': {'HTTPHeaders': {'date': 'Wed, 22 Nov 2017 19:54:40 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': '7KX1QmMEiwxwCC3D/FlydH3hhF3AM+nCyy8vj6LrlSQN8Fs9GL1kmDCKAtJ45av/l+rKr0UqMJ0=',
   'x-amz-request-id': '188479D5427A17D6'},
  'HTTPStatusCode': 204,
  'HostId': '7KX1QmMEiwxwCC3D/FlydH3hhF3AM+nCyy8vj6LrlSQN8Fs9GL1kmDCKAtJ45av/l+rKr0UqMJ0=',
  'RequestId': '188479D5427A17D6',
  'RetryAttempts': 0}}

But now all of the objects have been deleted, so let’s create a few more.

In [34]:
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl', 'test.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl', 'subdir/test.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'test2.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'subdir/test2.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'test3.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'subdir/subir/test3.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'subidr2/test3.txt')
s3.meta.client.upload_file('test.txt', 'demo-bucket-cdl2', 'subdir2/subdir/subdir/test2.txt')

Moving and copying objects

With the CLI

In [35]:
! aws s3 mv s3://demo-bucket-cdl/test.txt s3://demo-bucket-cdl/moved.txt
move: s3://demo-bucket-cdl/test.txt to s3://demo-bucket-cdl/moved.txt
In [36]:
! aws s3 cp s3://demo-bucket-cdl/subdir/test.txt s3://demo-bucket-cdl/test.txt
copy: s3://demo-bucket-cdl/subdir/test.txt to s3://demo-bucket-cdl/test.txt

With the SDK

boto3 doesn’t appear to have a move function, but it can be easily accomplished by first copying the file, and then deleting the original

In [37]:
s3.Object('demo-bucket-cdl2','moved2.txt').copy_from(CopySource='demo-bucket-cdl2/test2.txt')
s3.Object('demo-bucket-cdl2','test2.txt').delete()
Out[37]:
{'ResponseMetadata': {'HTTPHeaders': {'date': 'Wed, 22 Nov 2017 19:54:47 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': 'SSoeJd5OO9oLbglRYfgQjbGOZ+9VpKYt/mH2KZyOsqY8IKG4Q6fMF6KGGi6Qz/w4bXXjY/Oms4s=',
   'x-amz-request-id': '352FE9BEB8F61026'},
  'HTTPStatusCode': 204,
  'HostId': 'SSoeJd5OO9oLbglRYfgQjbGOZ+9VpKYt/mH2KZyOsqY8IKG4Q6fMF6KGGi6Qz/w4bXXjY/Oms4s=',
  'RequestId': '352FE9BEB8F61026',
  'RetryAttempts': 0}}

Listing buckets

With the CLI

The following command can be used to see which buckets you have access to.

In [38]:
! aws s3 ls
2017-11-22 20:54:35 demo-bucket-cdl
2017-11-22 20:54:36 demo-bucket-cdl2

This command recursively shows the files in the specified bucket (though the recursive option is not so useful in this case, given that there is only one file).

In [39]:
! aws s3 ls demo-bucket-cdl/ --recursive
2017-11-22 20:54:44         10 moved.txt
2017-11-22 20:54:40         10 subdir/test.txt
2017-11-22 20:54:46         10 test.txt

I wanted to see if I could recursively see all objects in all buckets with just one command. Apparently not:

In [40]:
! aws s3 ls --recursive
2017-11-22 20:54:35 demo-bucket-cdl
2017-11-22 20:54:36 demo-bucket-cdl2

With the SDK

First, to see which buckets are available to you:

In [41]:
for bucket in s3.buckets.all():
    print(bucket.name)
demo-bucket-cdl
demo-bucket-cdl2

This is the equivalent of the CLI ‘ls’ command with ‘–recursive’. As you can see, it iterative rather than recursive. This is because objects in S3 aren’t stored in a directory structure. Each object belongs to a bucket, and has a key which identifies it. When the bucket name and object key are combined you get something that looks like a file path.

In [42]:
for obj in s3.Bucket(name='demo-bucket-cdl2').objects.all():
    print(os.path.join(obj.bucket_name, obj.key))
demo-bucket-cdl2/moved2.txt
demo-bucket-cdl2/subdir/subir/test3.txt
demo-bucket-cdl2/subdir/test2.txt
demo-bucket-cdl2/subdir2/subdir/subdir/test2.txt
demo-bucket-cdl2/subidr2/test3.txt
demo-bucket-cdl2/test3.txt

To show all objects of all buckets:

In [43]:
for bucket in s3.buckets.all():
    for obj in bucket.objects.all():
        print(os.path.join(obj.bucket_name, obj.key))
demo-bucket-cdl/moved.txt
demo-bucket-cdl/subdir/test.txt
demo-bucket-cdl/test.txt
demo-bucket-cdl2/moved2.txt
demo-bucket-cdl2/subdir/subir/test3.txt
demo-bucket-cdl2/subdir/test2.txt
demo-bucket-cdl2/subdir2/subdir/subdir/test2.txt
demo-bucket-cdl2/subidr2/test3.txt
demo-bucket-cdl2/test3.txt

Recursively and selectively move, copy, and delete files

With the CLI

We can recursively move/copy all files in a given bucket to another folder.

In [44]:
! aws s3 cp s3://demo-bucket-cdl/ s3://demo-bucket-cdl/backup/ --recursive
copy: s3://demo-bucket-cdl/subdir/test.txt to s3://demo-bucket-cdl/backup/subdir/test.txt
copy: s3://demo-bucket-cdl/moved.txt to s3://demo-bucket-cdl/backup/moved.txt
copy: s3://demo-bucket-cdl/test.txt to s3://demo-bucket-cdl/backup/test.txt

We can also move/copy a subset of files, but to do so we need to use include and exclude parameters. Let’s use dryrun to see what will happen without making any changes yet.

(Note in the commands below that the * needs to be escaped in an iPython notebook in order for it to be interpreted properly in the shell command.)

In [45]:
! aws s3 mv s3://demo-bucket-cdl/ s3://demo-bucket-cdl/moved/ --include \*test\*.txt --exclude backup/* --recursive --dryrun
(dryrun) move: s3://demo-bucket-cdl/moved.txt to s3://demo-bucket-cdl/moved/moved.txt
(dryrun) move: s3://demo-bucket-cdl/subdir/test.txt to s3://demo-bucket-cdl/moved/subdir/test.txt
(dryrun) move: s3://demo-bucket-cdl/test.txt to s3://demo-bucket-cdl/moved/test.txt

The previous command did not work as expected (i.e. it should not have moved the moved.txt file). That’s because include and exclude are applied sequentially, and the starting state is from all files in s3://demo-bucket-cdl/. In this case, all six files that are in demo-bucket-cdl were already included, so the include parameter effectively did nothing and the exclude excluded the backup folder.

Let’s try again, first excluding all files.

In [46]:
! aws s3 mv s3://demo-bucket-cdl/ s3://demo-bucket-cdl/moved/ --exclude \* --include \*test\*.txt --exclude backup/* --recursive --dryrun
(dryrun) move: s3://demo-bucket-cdl/subdir/test.txt to s3://demo-bucket-cdl/moved/subdir/test.txt
(dryrun) move: s3://demo-bucket-cdl/test.txt to s3://demo-bucket-cdl/moved/test.txt

The same principles apply for the delete command as well.

With the SDK

As far as I know, including and excluding files is a manual process in the SDK. But doing it yourself is easy.

In [47]:
import re
objs = [os.path.join(obj.bucket_name, obj.key) 
        for obj in s3.Bucket(name='demo-bucket-cdl2').objects.all() 
        if re.match(".*test.*\.txt",obj.key)]
print("\n".join(objs))
demo-bucket-cdl2/subdir/subir/test3.txt
demo-bucket-cdl2/subdir/test2.txt
demo-bucket-cdl2/subdir2/subdir/subdir/test2.txt
demo-bucket-cdl2/subidr2/test3.txt
demo-bucket-cdl2/test3.txt

Deleting buckets

With the CLI

Since the bucket isn’t empty, we need to use the --force parameter.

In [48]:
! aws s3 rb s3://demo-bucket-cdl --force
delete: s3://demo-bucket-cdl/subdir/test.txt
delete: s3://demo-bucket-cdl/backup/test.txt
delete: s3://demo-bucket-cdl/backup/moved.txt
delete: s3://demo-bucket-cdl/backup/subdir/test.txt
delete: s3://demo-bucket-cdl/moved.txt
delete: s3://demo-bucket-cdl/test.txt
remove_bucket: demo-bucket-cdl

With the SDK

The bucket needs to be manually emptied before it can be deleted.

In [49]:
bucket = s3.Bucket('demo-bucket-cdl2')

# empty the bucket
for key in bucket.objects.all():
    key.delete()
    
# then delete it
bucket.delete()
Out[49]:
{'ResponseMetadata': {'HTTPHeaders': {'date': 'Wed, 22 Nov 2017 19:55:02 GMT',
   'server': 'AmazonS3',
   'x-amz-id-2': 'soQ/fzy0WW/JED64l3ippaj63ihz+qlU2z/SFgF8QJd42/QXn67tH1XHfO2jno/CeGHygevYpFY=',
   'x-amz-request-id': 'EB58B94E49D8AD23'},
  'HTTPStatusCode': 204,
  'HostId': 'soQ/fzy0WW/JED64l3ippaj63ihz+qlU2z/SFgF8QJd42/QXn67tH1XHfO2jno/CeGHygevYpFY=',
  'RequestId': 'EB58B94E49D8AD23',
  'RetryAttempts': 0}}

Last remarks on S3

I found the CLI and SDK for S3 quite easy to use. Figuring out how to set up my own S3 instance took some time, but the documentation was thorough and accurate. I’m looking forward to using S3 more in the future, but I am still a bit wary about going over the free limits. I checked my usage from writing this post. Here are the results (keep in mind that I ran these commands several times while writing this notebook):

Screen grab of S3 usage

Posting this notebook to WordPress

I followed these instructions to convert this notebook to html and post it to WordPress. The main difference is that I didn’t need to complete step 2 (downloading an iPython notebook) since I was already working on my own machine, and in step 3 the parameters were formatted slightly differently. I ran step 3 like so:

In [51]:
! jupyter nbconvert s3-jupyter-blogpost.ipynb --to html --template basic
[NbConvertApp] Converting notebook s3-jupyter-blogpost.ipynb to html
[NbConvertApp] Writing 48203 bytes to s3-jupyter-blogpost.html

I also had to add this to the css, since (1) the posted solution added anchor points that I didn’t like, and (2) the default css had a dark background that made it hard to read:

`
a.anchor-link{display:none}

.entry-content code {
background-color: rgba(0, 0, 0, 0);
}`

I ignored the advice about suppressing syntax highlighting in the <pre> tag, because I don’t really mind the different syntax coloring that is generated by the plugin that I use.

What’s the verdict on Colaboratory?

My first impressions of Colaboratory are quite positive, even though I needed to switch to standard Jupyter when I realised that I needed to connect to my AWS resources. Nevertheless, I did see some of the added value of colaboratory already. When I got an exception from running the command import boto3, a button appeared with the label “Search Stack Overflow”. The button did, indeed, search stack overflow for this exception. Handy! It is really nice that Google offers this service online, but of course that means that it is limited to the packages that are available in the platform. I also assume that the platform doesn’t support working offline, which would require a local installation of Jupyter.

Colaboratory is easy to use, has a nice interface and seems a bit more intuitive than the original interface for Jupyter notebooks. It also has several features that I haven’t yet seen in the origina Jupyter. It is really handy that markdown text sections have a preview which is periodically updated as you type. There is a bit of a delay before new text shows in the preview, but this is already much quicker than in Jupyter, which requires that you compile the cell to see how the text looks. One feature that I wish they had (or at least I don’t think is available) is to add markdown tags for formatting similar to what you have in document editors (for example, Ctrl+I could automatically wrap selected text in asterisks to make it italic).

1 thought on “Interacting with AWS S3 using Python in a Jupyter notebook”

  1. Thanks a lot it was very helpful for someone like me who is new to aws. I would like to know how to save a file(csv) to bucket using SDK.

Leave a Reply

Your email address will not be published. Required fields are marked *