How to identify tables referenced or not referenced in the last 90 days in Pivotal Greenplum
search cancel

How to identify tables referenced or not referenced in the last 90 days in Pivotal Greenplum

book

Article ID: 296080

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

You may be interested in answering one of the following questions: 
  • How do I locate who accessed a specific table in the last 90 days in Pivotal Greenplum?
  • How do I locate tables that have not been accessed in the last 90 days in Pivotal Greenplum?


Environment


Resolution

Note: This procedure assumes you are looking for this information from the last 90 days.
This time frame can be adjusted by modifying the number in queries where you see, current_date -90.

These queries are run against the Postgres log files in $MASTER_DATA_DIRECTORY/pg_log. If you archive and remove Postgres log files on a regular basis, this may limit how far back you can actually go using the logic presented in this article.

Additionally, the logic presented here requires log connections to be turned on (log_connections=on) in the postgresql.conf file:  

In this case, the table audit_table_full contains the raw data regarding table access.

Note: If you want the audit table to retain historical data, comment out the "drop table" statement. If this is done, it may cause an error message indicating the table already exists. This behavior is expected, and can be ignored:
drop table if exists audit_table_full;

create table audit_full
( loguser varchar(16),             -- username
  logdatabase varchar(128),    -- database name
  logsession varchar(20),         -- session id
  tname varchar(128),              -- table name
  updated timestamp)              -- when touched)
distributed randomly;
create temp table auditor
  (tname varchar(128))
  distributed randomly;

insert into auditor
 select schemaname || '.' || tablename from pg_catalog.pg_tables
where schemaname not in ('information_schema','gp_toolkit','pg_catalog') and   
          schemaname not like 'pg_temp%'   -- exclude temp tables
union
  select ' '   || tablename from pg_catalog.pg_table
  where schemaname = 'public' ;
The Temp table holds a record for each database connection record found.
create temp table audit_connect
(logtime timestamp,           -- timestamp on pg_log record 
loguser varchar(64),          -- userid
logdatabase varchar(64),  -- database name
logsession varchar(20),     -- session id
tname varchar(128))          -- table name
distributed randomly;
Read from Postgres log file(s) and insert the appropriate records into audit_connect table 1 for each connection made by any user connecting to this database.
insert into audit_connect
select min(logtime), loguser, logdatabase, logsession, tname
from gp_toolkit.__gp_log_master_ext, auditor
where logmessage like 'connection authorized:%'
and logtime > current_date - 90     -- check for activity in last 90 days
and logsession >  ' '
and logdatabase = current_database()
group by loguser, logdatabase, logsession, tname;
Get all records referencing any table in the database from the Postgres log file. The session must match session id in audit_connect and the table referenced must be in the auditor table. Since you know the log time of the connection record, use this in the "where" clause to limit how far back Postgres log file is searched.
create temp table audit_work
( loguser varchar(64),          -- userid
  logdatabase varchar(64),   -- database name
  logsession varchar(20),      -- session id
  tname varchar(128),           -- table name
  updated timestamp)            -- timestamp from postgres log file
distributed randomly;
Note: The cartesian product on the insert statement below is intended. Any type of activity (select, insert, update, or delete) is considered 'valid'. Depending on your needs, you may want to exclude some of these actions. 
insert into audit_work
      select b.loguser,
                b.logdatabase,
                b.logsession,
                b.tname,
                a.logtime
       from gp_toolkit.__gp_log_master_ext a,
               audit_connect b
       where a.logtime > b.logtime and
                  a.loguser = b.loguser and
                  a.logdatabase = b.logdatabase and
                  a.logsession = b.logsession and
                  a.logmessage like 'statement:%' and
                  (    a.logmessage like '% ' || ltrim(b.tname) || ' %'  or
                       a.logmessage like '% ' || ltrim(b.tname) || ';%')  and
                  (   a.logmessage like '%insert%' or
                      a.logmessage like '%update%' or
                      a.logmessage like '%select%' or
                      a.logmessage like '%delete)   
             ;
You need to address the public tables. A user may have entered a table name or qualified with a public schema. The following command will find all tables without a public schema and updated them to reflect the public schema.
update audit_work 
   set tname = 'public.' || substr(tname,2,127)
       where substr(tname,1,1) = ' ';
If you are recreating the audit_table_full each execution time, the next SQL statement will not remove any records. However if you are keeping the audit_table_full for history, then it is important to remove potential duplicate records.
delete from audit_work a
where exists
   (select 1 from audit_full b
     where a.loguser = b.loguser and
                a.logdatabase  = b.logdatabase and
                a.logsession = b.logsession and
                a.tname = b.tname and
                a.updated = b.updated);
Now insert the data into the catalogued audit_full table.
insert into audit_full select * from audit_work;
Create a table with all tables not referenced in last 90 days.

Note: Only pick up tables with a schema name, do not pick up tables with a blank schema name, as these have been updated in audit_full to have <public>.<tablename>.
drop table if exists audit_not_referenced;
create table audit_not_referenced
   as select a.tname from auditor a
  where a.tname not in
      (select distinct(tname) from audit_full)
       and substr(a.tname,1,1) != ' ';

The table, audit_not_referenced, has a list of all tables currently defined in the database but were not found in the audit_full table. Below are examples of other queries that can be run against the audit_full table.

If you are not cleaning up audit_full but still want a list of tables not referenced in last 90 days, run the following query:
select schemaname || '.' || tablename as unreferenced_table
from pg_catalog.pg_tables a
   where schemaname not in ('information_schema','gp_toolkit','pg_catalog')
       and schemaname not like 'pg_temp%'
       and not exists
            (select 1 from audit_full b
             where tname = a.schemaname || '.' || a.tablename
                 and updated > current_date - 90);  -- this line defines states how far to go back
Run the below command to find out when each table was last referenced:
select tname as table_name,
          max(updated)  as last_referenced
   from audit_full
  group by 1  order by 1;
Run the following query to find out which tables a particular user referenced, along with a count of how many times they referenced the table. The example below uses user 'testuser'.
select loguser  as userid, 
           tname as tablename ,
           max(updated) as last_update,
           count(*) as count
    from audit_full b
      where loguser = 'testuser'
      group by 1,2 order by 1,2;