Can someone please share any insights of how they have implemented data masking and obfuscation of sensitive data before the data lands in cloud storage such as s3 or Redshift? I have tried glue databrew but the solution does not scale for large number of files coming into the bucket concurrently.
Sort by:
no title5 months ago
Thanks for sharing.
Data Architect in Government8 months ago
Does it have to be masked in-flight ? Have you considered downloading to raw and then masking it?
no title5 months ago
Both masking in flight and masking after downloading are in scope, we might already have sensitive fields in our historical data, we would like to clean that up. We would also like to prevent any new sensitive data coming in.
For auditability and troubleshooting, we brought the file to S3 landing bucket in raw form and then applied the data tokenization for sensitive columns. Next activity is to delete the raw data immediately after use or retain for couple of days to support future troubleshooting. Also ensure that except service account, no one can read the raw file.