Often in bioinformatics I receive a dataset that is entirely
non-relational. For instance, every row is a gene, every column is a
biological sample, and the cell values are the expression levels of each
gene measured by microarray. To join such datasets to others (e.g.
metadata on the samples and on the genes), I need to relationalize the
table. In the above example, that would mean building a new table with
three columns (gene, sample, expression_level). Trouble is, then later
if I want to do any fancy matrix math on the data I need to
de-relationalize it, getting back to the rows=genes, columns=samples
layout. This is called a pivot table and can be achieved in
PostgreSQL using the
To avoid spending my life typing out column names, I wrote a function in Postgres procedural language (PL/pgSQL) that will generate a
Here’s the code. I run PostgreSQL 9.2 but I believe this should work at least as far back as 8.4.
Executing that last select statement will return the following:
And executing that query will in turn give you:
The underscore in front of the column names is so that it will still work if the columns start with numbers.
if your rows and columns in the final pivoted table won’t change, then you can just add a
crosstab()
function, but there’s a catch: you need to type out all of the column names.To avoid spending my life typing out column names, I wrote a function in Postgres procedural language (PL/pgSQL) that will generate a
crosstab
query automatically. You can’t execute the generated crosstab
query automatically because PL/pgSQL functions that return tables (setof record
functions) cannot have the number and type of columns determined on the fly. Instead this function returns the query in
a varchar, and then you can execute the query to get the pivot table
you want. That’s an extra round-trip to the database if you’re working
programmatically, or an extra copy-and-paste if you’re working in the
query editor. But it’s better than typing out column names.Here’s the code. I run PostgreSQL 9.2 but I believe this should work at least as far back as 8.4.
-- PL/pgSQL code to create pivot tables with automatic column names -- Eric Minikel, CureFFI.org - 2013-03-19 -- prerequisite: install the tablefunc module create extension tablefunc; -- tablename: name of source table you want to pivot -- rowc: the name of the column in source table you want to be the rows -- colc: the name of the column in source table you want to be the columns -- cellc: an aggregate expression determining how the cell values will be created -- celldatatype: desired data type for the cells create or replace function pivotcode (tablename varchar, rowc varchar, colc varchar, cellc varchar, celldatatype varchar) returns varchar language plpgsql as $$ declare dynsql1 varchar; dynsql2 varchar; columnlist varchar; begin -- 1. retrieve list of column names. dynsql1 = 'select string_agg(distinct ''_''||'||colc||'||'' '||celldatatype||''','','' order by ''_''||'||colc||'||'' '||celldatatype||''') from '||tablename||';'; execute dynsql1 into columnlist; -- 2. set up the crosstab query dynsql2 = 'select * from crosstab ( ''select '||rowc||','||colc||','||cellc||' from '||tablename||' group by 1,2 order by 1,2'', ''select distinct '||colc||' from '||tablename||' order by 1'' ) as newtable ( '||rowc||' varchar,'||columnlist||' );'; return dynsql2; end $$ -- toy example to show how it works create table table_to_pivot ( rowname varchar, colname varchar, cellval numeric ); insert into table_to_pivot values ('row1','col1',11); insert into table_to_pivot values ('row1','col2',12); insert into table_to_pivot values ('row1','col3',13); insert into table_to_pivot values ('row2','col1',21); insert into table_to_pivot values ('row2','col2',22); insert into table_to_pivot values ('row2','col3',23); insert into table_to_pivot values ('row3','col1',31); insert into table_to_pivot values ('row3','col2',32); insert into table_to_pivot values ('row3','col3',33); select pivotcode('table_to_pivot','rowname','colname','max(cellval)','integer');
Executing that last select statement will return the following:
select * from crosstab ( 'select rowname,colname,max(cellval) from table_to_pivot group by 1,2 order by 1,2', 'select distinct colname from table_to_pivot order by 1' ) as newtable ( rowname varchar,_col1 integer,_col2 integer,_col3 integer );
And executing that query will in turn give you:
The underscore in front of the column names is so that it will still work if the columns start with numbers.
if your rows and columns in the final pivoted table won’t change, then you can just add a
create view
statement to the front of it, like so:
create view view_of_pivoted_table as select * from crosstab ( 'select rowname,colname,max(cellval) from table_to_pivot group by 1,2 order by 1,2', 'select distinct colname from table_to_pivot order by 1' ) as newtable ( rowname varchar,_col1 integer,_col2 integer,_col3 integer ); select * from view_of_pivoted_table;
No comments:
Post a Comment