Lately I’ve been getting re-acquainted with the boto libraries for interacting with AWS. The newest version is boto3 which is a more stable service oriented API – very different from the original boto libraries.
Anyway, one of the first things I was looking at was reading an S3 object into a string – which is a pretty common use case for reading configuration files in YAML or JSON (or whatever) stored in an S3 bucket into memory so you can do something useful with it.
import boto3 s3 = boto3.client('s3', region='ap-southeast-2', use_ssl=True) obj = s3.Object(bucket, key) object_string = obj.get()['Body'].read().decode('utf-8')
The problem here is one I’ve seen over and over again – the lower level implementation details have leaked up through the abstraction layer. In this case we’ve need to “know” to access the ‘Body’ key of the response object returned by the call to
get(). Then we’ve also had to “know” that this object implements a read method that returns bytes.
In one line we’ve made a whole bunch of assumptions and bound our code into them. If any of the underlying details change – our code breaks. And being python, it won’t break nicely either.
The whole point of using an API layer is to abstract away implementation details like this and prevent pollution of the higher level client consumer codebase. It’s about isolating parts of software systems from other parts of the system. This has a whole bunch of benefits that a lot of data scientists aren’t aware of because they haven’t come from a software background. I’ll return to this idea again in the future.
Anyway I know SO doesn’t always have the best solutions – just ones that work – so I dug around the API a bit and came up with this:
import boto3 import io s3 = boto3.client('s3', region='ap-southeast-2', use_ssl=True) with io.BytesIO() as mem_buff: s3.download_fileobj(bucket, key, mem_buff) object_bytes = mem_buff.getvalue() object_string = object_bytes.decode('utf-8')
Here I’ve stayed high-level all the way – we’re just downloading to an in-memory file object directly.
Just a note too – I only read objects from S3 into RAM like this for relatively quick tasks involving small objects (less than 1Mb say). A more serious implementation would allocate arrays using the content-length response header and read in chunks directly into that.
As always – your mileage may vary…