Getting Started With Amazon Redshift
I recently worked on a project in which I was asked to migrate someones analytics database into a Redshift cluster. The backend of this website was using Django, so all of my development was done in Python.
A had a couple of goals before starting the project:
- Keep everything in Python
- Utilize Django’s ORM when possible
- Make migrations easy
The first step is to launch a Red Shift cluster on EC2 and your local ip address has to be added CIDR/IP security group of in the redshift dashboard. For testing I like to set it to 0.0.0.0\0
.
Once you have your cluster up and running, you can try to connect to it using the following command
psql "host=XXXXXXXX.cilmdlzscpk4.us-east-1.redshift.amazonaws.com user=XXXXXXX dbname=XXXXX port=5439"
If you can successfully connected, you’re all set, you need to setup your tables. I used the psycopg2
to setup these tables, but you could of course manually enter them into the REPL. If you want to take my approach first, assuming you are running on ubuntu, install the postgres
requirements
sudo apt-get install -y postgresql-client-9.3 libpq-dev
Now install psycopg2
using pip:
pip install psycopg2
Since we are going to utilize django's
orm we need to update settings.py
:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.mysql',
'NAME': 'home',
'USER': 'home',
'PASSWORD': 'home',
'HOST': 'localhost',
},
'analytics': {
'NAME': 'redshift',
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'USER': 'redshift',
'PASSWORD': 'xxxxxxxxx',
'HOST': 'XXXXXXXXXXX.cilmdlzscpk4.us-east-1.redshift.amazonaws.com',
'PORT': 5439,
}
}
We also need to use a database router, to make sure that reads and writes go to redshift on a per app basis (this is something I did per the design of my application, but you might not need to do this):
class Router(object):
def db_for_read(self, model, **hints):
if model._meta.app_label == 'analytics':
return 'analytics'
return None
def db_for_write(self, model, **hints):
if model._meta.app_label == 'analytics':
return 'analytics'
return None
def allow_relation(self, obj1, obj2, **hints):
if obj1._meta.app_label == 'analytics' or \
obj2._meta.app_label == 'analytics':
return True
return None
def allow_migrate(self, db, app_label, model=None, **hints):
if app_label == 'analytics':
return False
return None
And finally, add the router to settings.py
DATABASE_ROUTERS = ['path.to.router.Router',]
OK. everything is setup and ready to go! Adding tables is as simple as calling SQLs create table
command:
CREATE_TABLE = """
create table mytable(
id INT IDENTITY(1,1),
event_id integer not null,
job_id integer not null,
count integer not null,
primary key(id));
"""
Now using psycopg2
:
conn = psycopg2.connect(
host=settings.DATABASES['redshift']['HOST'],
user=settings.DATABASES['redshift']['USER'],
port=settings.DATABASES['redshift']['PORT'],
password=settings.DATABASES['redshift']['PASSWORD'],
dbname=settings.DATABASES['redshift']['NAME'])
cur = self.conn.cursor()
cur.execute(CREATE_TABLE)
conn.commit()