This page provides you with instructions on how to extract data from a PostgreSQL database and load it into Amazon Redshift on an ongoing basis. (If this manual process is a bit more involved than you’d prefer, check out Stitch, which can do all the heavy lifting for you in just a few clicks.)
What is PostgreSQL?
PostgreSQL, often referred to as Postgres, pgSQL, or simply pg, is an object-relational database management system (ORDBMS) with an emphasis on extensibility and on standards-compliance. As a database server, its primary function is to store data securely, supporting best practices, and to allow for retrieval at the request of other software applications. It can handle workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.
Pulling Data Out of PostgreSQL
There are several methods for extracting data from PostgreSQL, detailed below:
Most people retrieve data from databases by simply writing queries. In Postgres, SELECT queries allow you to pull exactly the data you want by specifying filters, ordering, and limiting results. If you have a specific subset of data in mind or are looking to continuously monitor a subset of a specific table, SELECT queries may be a good fit.
If you’re just looking to export data in bulk, however, there is a much easier and faster way to do it. You can use the command-line tool called pg_dump that allows you to export data from a PostgreSQL database as a flat file (i.e. CSV) or a script that can be run to restore the database on any Postgres server.
Preparing PostgreSQL Data for Redshift
Here’s the tricky part: for every table in PostgreSQL, you need a matching table in Redshift to receive the data. Thankfully, Redshift is the perfect match for this since its syntax is actually based on the syntax of Postgres.
If you’ve used pg_dump to export a restore script, most if not all of the code it generates can be simply executed against a Redshift instance to restore the data (or just the structure if that’s what you exported– this may be preferable if you are going to load data using the COPY command as detailed below).
If you haven’t used pg_dump, you can look to the information_schema database, which contains all of the metadata information you’ll need to recreate your tables in another environment.
Inserting PostgreSQL Data into Redshift
With yours table built, and if you haven’t restored via pg_dump, it may seem like the easiest way to add your exported data (especially if there isn’t much of it), is to build INSERT statements to add data to your Redshift table row-by-row. If you have any experience with SQL, this will be your gut reaction. But beware! Redshift isn’t optimized for inserting data one row at a time, and if you have any kind of high-volume data being inserted, you would be much better off loading the data into Amazon S3 and then using the COPY command to load it into Redshift.
Keeping Data Up-To-Date
So, now what? You’ve built a script that pulls data from Postgres and loads it into Redshift, but what happens tomorrow when you have new and updated records in your Postgres database?
Depending on how you’ve built your script, you may be forced to load your entire database into Redshift again. This might be slow and painful depending on your setup, and it can introduce costly delays if latency is of value to you.
The key is to build your script in such a way that it can also identify incremental updates to your data. If your PostgreSQL tables have fields like modified_at or auto-incrementing primary keys, you can build a script that can quickly identify records that are new or changed since your last update (or since the newest record you’ve copied into Redshift). You can set your script up as a cron job or continuous loop to keep pulling down new data as it appears.
Easier and Faster Alternatives
If you made it this far without your brain getting fried, chances are building and maintaining a script like this isn’t a very high-leverage use of your time.
Thankfully, products like Stitch were built to solve this exact problem automatically. With just a few clicks, Stitch starts extracting your PostgreSQL data, structuring it in a way that is optimized for analysis, and inserting that data into your Amazon Redshift data warehouse.