The Personal Data Store Pattern
With the recent trend towards data protection and privacy, as well as the requirements of data protection regulations like GDPR and CCPA, some organizations are trying to reorganize their personal data so that it has a higher level of protection.
One path that I’ve seen organizations take is to apply the (what I call) “Personal data store” pattern. That is, to extract all personal data from existing systems and store it in a single place, where it’s accessible via APIs (or in some cases directly through the database). The personal data store is well guarded, audited, has proper audit trail and anomaly detection, and offers privacy-preserving features.
It makes sense to focus one’s data protection efforts predominantly in one place rather than scatter it across dozens of systems. Of course it’s far from trivial to migrate so much data from legacy systems to a new module and then upgrade them to still be able to request and use it when needed. That’s why in some cases the pattern is applied only to sensitive data – medical, biometric, credit cards, etc.
For the sake of completeness, there’s something else called “personal data stores” and it means an architecture where the users themselves store their own data in order to be in control. While this is nice in theory, in practice very few users have the capacity to do so, and while I admire the Solid project, for example, I don’t think it is viable pattern for many organizations, as in many cases users don’t directly interact with the company, but the company still processes large amounts of their personal data.
Again, for the sake of completeness, there’s “Master data management” which is closely related to what I’m describing here. It is not concerned with personal data per se, and the focus there is on making sure the data is correct through proper procedures, but the personal data store is a way to implement master data management.
So, the personal data store pattern is an architectural approach to personal data protection. It can be implemented as a “personal data microservice”, with CRUD operations on predefined data entities, an external service can be used (e.g. SentinelDB, a project of mine), or it can just be a centralized database that has some proxy in front of it to control the access patterns. You can imagine it as externalizing your application’s “users” table and its related tables.
It sounds a little bit like a data warehouse for personal data, but the major difference is that it’s used for operational data, rather than (just) analysis and reporting. All (or most) of your other applications/microservices interact constantly with the personal data store whenever they need to access or update (or “forget”) personal data.
Some of the main features of such a personal data store, the combination of which protect against data breaches, in my view, include:
- Easy to use interface (e.g. RESTful web services or simply SQL) – systems that integrate with the personal data store should be built in a way that a simple DAO layer implementation gets swapped and then data that was previously accessed form a local database is now obtained from the personal data store. This is not always easy, as ORM technologies add a layer of complexity.
- High level of general security – servers protected with 2FA, access control, segregated networks, restricted physical access, firewalls, intrusion prevention systems, etc. The good things is that it’s easier to apply all the best practices applied to a single system instead of applying it (and keeping it that way) to every system.
- Encryption – but not just “data at rest” encryption; especially sensitive data can and should be encrypted with well protected and rotated keys. That way the “honest but curious” admin won’t be able to extract anything form the underlying database
- Audit trail – all infosec and data protection standards and regulations focus on accountability and traceability. There should not be a way to extract or modify personal data without leaving a trace (and ideally, that trace should be protected as well)
- Anomaly detection – checking if there is something strange/anomalous in the data access patterns. Such strange access patterns can mean a data breach is happening, and the personal data store can actively block it. There is a lot of software out there that does anomaly detection on network traffic, but it’s much better if the rules (or machine learning) are domain-specific. “Monitor for increased traffic to those servers” is one thing, but it’s much better to be able to say “monitor for out-of-the ordinary accesses to personal data of such and such kind”
- Pseudonymization – many systems that need the personal data don’t actually need to know who it is about. That includes marketing, including outsourcing to 3rd parties, reporting functionalities, etc. So the personal data store can return data that does not allow a person do be identified, but a pseudo-ID instead. That way, when updates are made back to the personal data store, they can still refer to a particular person, via the pseudonymous ID, but the application that extracted the data in the first place doesn’t get to know who the data was about. This is useful in scenarios where data has to be (temporarily or not) stored in a database that lies outside the personal datastore.
- Authentication – if the company offers user authentication, this can be done via the personal data store. Passwords, two-factor authentication secrets and other means of authentication are personal data, and an important one as well. An organization may use a single-sign-on internally (e.g. Active Directory), but it doesn’t make sense to put customers there, too, so they are usually stored in a database. During authentication, the personal data store accepts all necessary credentials (username, password, 2FA code), and return a token to be used for subsequent calls or to be used a a session cookie token.
- GDPR (or CCPA or similar) functionalities – e.g. export of all data about a person, forgetting a person. That’s an often overlooked problem, but “give me all data about me that you have” is an enormous issue with large companies that have dozens of systems. It’s next to impossible to extract the data in a sensible way from all the systems. Tracking all these requests is itself a requirement, so the personal data store can keep track of them to present to auditors if needed.
That’s all easier said than done. In organizations that have already many systems working alongside and processing personal data, migration can be costly. So it’s a good idea to introduce it as early as possible, and have a plan (even if it lasts for years) to move at least sensitive personal data to the well protected silo. This silo is a data engineering effort, a system refactoring effort and an organizational effort. The benefits, though, are reduced long-term cost and reduced risks for data breaches and non-compliance.
With the recent trend towards data protection and privacy, as well as the requirements of data protection regulations like GDPR and CCPA, some organizations are trying to reorganize their personal data so that it has a higher level of protection.
One path that I’ve seen organizations take is to apply the (what I call) “Personal data store” pattern. That is, to extract all personal data from existing systems and store it in a single place, where it’s accessible via APIs (or in some cases directly through the database). The personal data store is well guarded, audited, has proper audit trail and anomaly detection, and offers privacy-preserving features.
It makes sense to focus one’s data protection efforts predominantly in one place rather than scatter it across dozens of systems. Of course it’s far from trivial to migrate so much data from legacy systems to a new module and then upgrade them to still be able to request and use it when needed. That’s why in some cases the pattern is applied only to sensitive data – medical, biometric, credit cards, etc.
For the sake of completeness, there’s something else called “personal data stores” and it means an architecture where the users themselves store their own data in order to be in control. While this is nice in theory, in practice very few users have the capacity to do so, and while I admire the Solid project, for example, I don’t think it is viable pattern for many organizations, as in many cases users don’t directly interact with the company, but the company still processes large amounts of their personal data.
Again, for the sake of completeness, there’s “Master data management” which is closely related to what I’m describing here. It is not concerned with personal data per se, and the focus there is on making sure the data is correct through proper procedures, but the personal data store is a way to implement master data management.
So, the personal data store pattern is an architectural approach to personal data protection. It can be implemented as a “personal data microservice”, with CRUD operations on predefined data entities, an external service can be used (e.g. SentinelDB, a project of mine), or it can just be a centralized database that has some proxy in front of it to control the access patterns. You can imagine it as externalizing your application’s “users” table and its related tables.
It sounds a little bit like a data warehouse for personal data, but the major difference is that it’s used for operational data, rather than (just) analysis and reporting. All (or most) of your other applications/microservices interact constantly with the personal data store whenever they need to access or update (or “forget”) personal data.
Some of the main features of such a personal data store, the combination of which protect against data breaches, in my view, include:
- Easy to use interface (e.g. RESTful web services or simply SQL) – systems that integrate with the personal data store should be built in a way that a simple DAO layer implementation gets swapped and then data that was previously accessed form a local database is now obtained from the personal data store. This is not always easy, as ORM technologies add a layer of complexity.
- High level of general security – servers protected with 2FA, access control, segregated networks, restricted physical access, firewalls, intrusion prevention systems, etc. The good things is that it’s easier to apply all the best practices applied to a single system instead of applying it (and keeping it that way) to every system.
- Encryption – but not just “data at rest” encryption; especially sensitive data can and should be encrypted with well protected and rotated keys. That way the “honest but curious” admin won’t be able to extract anything form the underlying database
- Audit trail – all infosec and data protection standards and regulations focus on accountability and traceability. There should not be a way to extract or modify personal data without leaving a trace (and ideally, that trace should be protected as well)
- Anomaly detection – checking if there is something strange/anomalous in the data access patterns. Such strange access patterns can mean a data breach is happening, and the personal data store can actively block it. There is a lot of software out there that does anomaly detection on network traffic, but it’s much better if the rules (or machine learning) are domain-specific. “Monitor for increased traffic to those servers” is one thing, but it’s much better to be able to say “monitor for out-of-the ordinary accesses to personal data of such and such kind”
- Pseudonymization – many systems that need the personal data don’t actually need to know who it is about. That includes marketing, including outsourcing to 3rd parties, reporting functionalities, etc. So the personal data store can return data that does not allow a person do be identified, but a pseudo-ID instead. That way, when updates are made back to the personal data store, they can still refer to a particular person, via the pseudonymous ID, but the application that extracted the data in the first place doesn’t get to know who the data was about. This is useful in scenarios where data has to be (temporarily or not) stored in a database that lies outside the personal datastore.
- Authentication – if the company offers user authentication, this can be done via the personal data store. Passwords, two-factor authentication secrets and other means of authentication are personal data, and an important one as well. An organization may use a single-sign-on internally (e.g. Active Directory), but it doesn’t make sense to put customers there, too, so they are usually stored in a database. During authentication, the personal data store accepts all necessary credentials (username, password, 2FA code), and return a token to be used for subsequent calls or to be used a a session cookie token.
- GDPR (or CCPA or similar) functionalities – e.g. export of all data about a person, forgetting a person. That’s an often overlooked problem, but “give me all data about me that you have” is an enormous issue with large companies that have dozens of systems. It’s next to impossible to extract the data in a sensible way from all the systems. Tracking all these requests is itself a requirement, so the personal data store can keep track of them to present to auditors if needed.
That’s all easier said than done. In organizations that have already many systems working alongside and processing personal data, migration can be costly. So it’s a good idea to introduce it as early as possible, and have a plan (even if it lasts for years) to move at least sensitive personal data to the well protected silo. This silo is a data engineering effort, a system refactoring effort and an organizational effort. The benefits, though, are reduced long-term cost and reduced risks for data breaches and non-compliance.
I would add:
* ACL based authorization on the attribute level, so you can control which information elements to share with whom.
Absolutely!