Dismantling Data De-Identification: Towards a Relational Data Governance Framework

Abstract

Data governance across Canada is undergoing major reforms. In this context, data de-identification is being proposed as a mechanism for preserving individuals’ privacy while still allowing the data economy to grow. Through an analysis of standard de-identification protocols and by applying the lenses of data justice and data democracy, we consider de-identification’s technical and conceptual limits. We argue that data governance must recognize data’s specificity, relationality, and structural components to effectively navigate the realities of an increasingly concentrated data economy. As we demonstrate, data de-identification’s individualised model of data harm cannot account for these dimensions.

Introduction

In 2018 and 2019, Sidewalk Labs (Roth, 2018, 2019) and Facebook’s use of data (Hemmadi, 2019) occupied national headlines, and debates over the collection and use of personal data in Canada gained traction in public discourse. In this context, the Canadian federal government was preparing to overhaul Canada’s privacy and data protection regimes with updates to PIPEDA and the Privacy Act. And in the private sector, several of Canada's largest data holders came together to form CANON—the Canadian Anonymization Network. CANON’s membership includes all three national telecommunications companies (Telus, Bell, and Rogers), credit reporting agency TransUnion, payment processor Moneris, CIBC, and TD Bank (Canadian Anonymization Network, 2019a).

CANON’s goal is to advocate for and develop standards around data de-identification—the process of manipulating data in such a way that it can no longer be easily associated with an individual it represents. While specific techniques vary depending on the kind of data in question (Ghinita et al., 2011), the general goal of data de-identification is to reduce the risk that any individual person can be identified from a data set. De-identification is typically used to enable data sharing—particularly in the health sector. Ultimately, CANON’s goal is to advocate for legislation and policies that broaden the uses of de-identification and allow for expanded big data processing and increased data flows between business sectors (Canadian Anonymization Network, 2019a).

CANON’s members have already demonstrated the ways they hope to use this expanded opportunity. In particular, Canadian telecommunications companies have signalled a desire to both sell network data as a new revenue stream, and to internally increase collection and analysis of data to optimise their operations and expand into new markets. For example, Bell acquired data aggregator Environics Analytics—signalling a move into data brokerage (Environics Analytics, 2020)—and Telus has become increasingly public about leveraging the data produced by its telecommunications network through its Data for Good initiative (Telus, n.d.). Bell has also recently announced its plans to sell advertisers access to de-identified network and usage data through its "DSP" program (Bell Media, 2021). These practices currently exist in a legal grey area. CANON's work aims to legitimise them and cement their place in Canadian policy.

Increased allowances for the collection, processing, and movement of de-identified data amount to a major change in the Canadian communications and telecommunications landscape. The debate over the use of data provided by Telus to the Public Health Agency of Canada (PHAC) underscores that Canadian telecommunication providers are shifting from being solely service and infrastructure providers towards being data brokers (Boutilier, 2022; Parsons, 2022). But despite these broad implications, data and de-identification are largely being addressed in the context of privacy legislation. The federal Bill C-11—which was introduced in 2020 and ultimately died on the order paper when Parliament was dissolved for the 2021 federal election—would have put de-identification at the heart of the Canadian data regulation regime. And the re-elected Liberal government has indicated that it intends to re-introduce substantially similar legislation in the new parliament (Trudeau, 2021). Provincially, Quebec’s Bill 64 was adopted unanimously by the National Assembly in September 2021, making de-identification a major part of how the province regulates the use of personal information.

As it stands, data de-identification is being positioned as a broad solution to perceived harms resulting from intersectoral data sharing (such as with PHAC and Telus). However these broad applications do not consider the data’s specificities; different kinds of data (Ghinita et al., 2011) and data from different sources all introduce their own complexities. Moreover, narrowly addressing data harms through privacy legislation only focuses on the risk of personal re-identification. This is an important issue, but it is not the entire problem.

By considering these issues through a data justice lens, we argue that there is a range of harms which data de-identification does not only fails to consider, but which it actively exacerbates. Drawing on the academic literature on de-identification and re-identification, as well as an analysis of standard de-identification protocols, we argue that de-identification as mobilised in Canadian legislation fails to address the risk of re-identification and the general leakiness of de-identified data. We then draw on data justice and data democracy perspectives to argue that at a more fundamental level, the treatment of data as an individual good fails to recognize the relationality of data. Ultimately, we demonstrate the undressed technical issues with data de-identification and the conceptual limits of approaching data harms from a privacy perspective. We argue that these issues make de-identification an insufficient framework for general data governance. Instead, we argue that data governance must recognize data’s specificity, relationality, and structural components to effectively navigate the realities of an increasingly concentrated data economy.

History and Context of De-Identification

De-identification of data as a means of privacy protection long predates the current wave of regulatory reforms. In particular, data de-identification is commonplace in the sharing of clinical and health data, and has been an important tool for enabling health research (Health Canada, 2019; Huser & Shmueli-Blumberg, 2018).

As more sectors of the economy are using data in their business practices, the use of de-identification has increasingly expanded beyond healthcare. Within Canada, the de-identification of data plays a significant role in Ontario’s Freedom of Information and Protection of Privacy Act (2021). And several of Canada's largest data holders have come together to advocate for and develop standards around data de-identification through CANON (Canadian Anonymization Network, 2019a). In all these cases, de-identification is positioned as allowing for the continued large-scale exploitation of data while still preserving individual privacy.

De-identification also features prominently in two of the major data governance reforms recently proposed in Canadian legislatures—the federal Bill C-11 and Quebec’s Bill 64.

Bill C-11Footnote 1Footnote 2—whose full title was the “Consumer Privacy Protection Act and the Personal Information and Data Protection Tribunal Act and to make consequential and related amendments to other Acts”—is the Canadian federal government’s most recent proposal to regulate the use of data in Canada. Perhaps the most significant change proposed by C-11 was an explicit requirement that informed consent be obtained for the “collection, use or disclosure of the individual’s personal information” (C-11, 2020, §15). But this requirement comes with caveats, as in many cases C-11 does not require consent so long as data is de-identified. For example, Bill C-11 would have allowed de-identified information to be used for internal research and development, for "socially beneficial purposes," or for prospective business transactions without consent.

These are the explicit ways C-11 mobilised de-identification, but the bill also used the concept more subtilty in its definition of “personal information” itself. Carrying on a definition from C-11’s predecessor—the Personal Information Protection and Electronic Documents Act (PIPEDA)—C-11would define personal information as being “information about an identifiable individual” (C-11, 2020; PIPEDA, 2019). As such, if de-identified information was no longer considered to be about an identifiable individual it would have fallen entirely outside of the scope of the bill’s protections.

Quebec’s Bill 64—“An Act to modernise legislative provisions as regards the protection of personal information”—takes a similar approach to data and de-identification to the federal Bill C-11. Like C-11, Bill 64 requires that consent be given for the use of personal information. And like C-11, Bill 64 creates exceptions to that requirement including not requiring consent “if [personal information’s] use is necessary for study or research purposes or for the production of statistics and if the information is de-identified” (Bill 64, 2021, §110).

The major difference in how Bill 64 treats de-identified data however is in its definition of “personal information.” Unlike C-11, which defined de-identified information as no longer being personal information at all, Bill 64 creates a separate category for de-identified personal information. It defines personal information as being de-identified if “it no longer allows the person concerned to be directly identified" (Bill 64, 2021, §110). As such, while Bill 64 still places a lot of significance on de-identification, it also recognizes the provenance of de-identified data in a more substantive way.

Internationally, perhaps the most significant invocation of data de-identification has been in the European Union’s “General Data Protection Regulation” (GDPR). When it came into force in 2018, the GDPR placed significant new regulations on those handling data in the EU. As Hintze (2018) has documented, the de-identification provisions in the GDPR are numerous and varied. In some cases, de-identification of data reduces the required level of scrutiny under the GDPR, while if data is considered completely anonymized it removes the data holder’s obligations under the GDPR altogether (Hintze, 2018).

Within Canada, at the federal level at least, de-identification did not play a significant role in the legislative framework around data and privacy prior to the proposed Bill C-11. In the wake of the GDPR and in the lead up to the publication of Bill C-11, the Office of the Privacy Commissioner (OPC) commissioned a report on the uses of data de-identification in public policy (Rosner, 2019). The report recommended that de-identification be integrated into Canada's privacy framework for several reasons, including that it would help Canada fulfil the GDPR’s adequacy requirements. It is notable however that while Gilad Rosner's report to the OPC recommended the use of de-identification, it more specifically recommended the use of a risk-assessment based model that discourages “release and forget” publication of data; Bill C-11 proposed the use of de-identification without either of these conditions.

We should note that GDPR adequacy is undoubtedly a major issue for Canada. Article 45 of the GDPR gives the European Commission the power to determine whether a non-European country’s privacy and data protection laws are adequate as-per EU standards. If Canada achieved GDPR adequacy status it would allow for the free flow of data between Canada and EU member states (European Commission, 2021; GDPR, 2016). Adopting de-identification would be a straightforward path for Canada to achieving GDPR adequacy, but as we argue its technical and conceptual limitations make it an undesirable approach from a democratic perspective. GDPR adequacy remains a major issue for Canada’s relationship with the EU and may signal the need for international cooperation on this issue should Canada choose to take a different approach to data governance.

Technical Analysis and Limitations of De-Identification

Data de-identification is a set of technical procedures as much as it is a concept. Considering de-identification from a technical point of view, much of the literature on de-identification points to the flaws inherent in this process. In these cases, the focus has been on the risk of re-identification in de-identified datasets. Culnane, Rubinstein, and Teague (2017), for example, demonstrated that it was possible to re-identify patients from a de-identified dataset of Australian medical billing records. In another high-profile case, it was demonstrated that an individual's browsing habits could be identified from a publicly released dataset of de-identified German web browsing records (Hern, 2017). This debate has been especially heated in Canada where, in 2014, Ontario’s Information and Privacy Commissioner released a report defending de-identification and casting aspersions on re-identification studies (Cavoukian & Castro, 2014). As Cory Doctorow (2014) has noted however, this report was methodologically flawed and has been thoroughly rebutted (Narayanan & Felten, 2014).

As a possible saving grace for de-identification, there have been some questions raised about the accuracy of re-identification in partial datasets (Barth-Jones, 2012). In effect, if one is not certain that an individual's record is in a dataset to begin with, the accuracy of the re-identification may be called into question. But as Rocher, Hendrickx, and de Montjoye (2019) demonstrated, it is possible to train a model that can assess whether an individual has been correctly re-identified to a high degree of certainty.

Given the strong potential for re-identification from a de-identified dataset, there have been calls for more nuanced policy approaches to privacy protection. Narayanan, Huey, and Felten (2016), for example, call for a "precautionary approach.” This would involve, among other things, a recognition that all datasets can be re-identified and thus a nuanced approach to data released models depending on the context and sensitivity of the data in question. It bears noting that even this proposed precautionary approach still focuses on taking precautions against individual re-identification. Community and structural risks enabled by aggregated data are not considered in this literature, or in the mobilisation of de-identification in Canadian policy.

Considering this literature, what follows in the remainder of this section is an analysis of the Information and Privacy Commissioner of Ontario (IPCO)’s data de-identification guidelines (Information and Privacy Commissioner of Ontario, 2016). Due to their broad applicability and technical specificity, the IPCO guidelines form the basis for several data de-identification requirements in Canada—including Health Canada’s guidelines for the public release of clinical information. And given that neither Bill C-11 or Bill 64 specifies a particular standard for de-identification, we have turned to the primary Canadian standard for data de-identification as an example of a standard that would likely satisfy the requirements of both bills.

In many ways, the data de-identification guidelines from the IPCO echo Narayanan, Huey, and Felton’s proposed precautionary approach. Initially designed as a means for practically implementing the de-identification requirements set out in Ontario’s Freedom of Information and Protection of Privacy Act, the IPCO guidelines take a risk-assessment based approach to data de-identification. This is partially due to their scope, as the guidelines are meant to enable the disclosure of a diverse array of data sets in a diverse set of contexts. As such they require different levels of diligence depending on the data’s sensitivity and the proposed release-model.

Taking re-identification as the primary risk to be mitigated, the IPCO guidelines provide a framework for assigning risk levels based on the kind of data (data risk) and the release-model of the data in question (context risk). In both cases, risk levels are quantified, and the document provides equations for calculating risk. Data risk, for example, is calculated based on the risk of re-identification of a given row. A row in this case refers to a specific individual's data in a dataset, while a column represents a variable. The risk score for a given row is equal to 1 over the size of the equivalence class—the set of rows that share the same identifiers or characteristics in a dataset. The final data risk is then calculated based on the release modelFootnote 3; public and semi-public releases are considered to have the maximum risk-level, while risk for non-public releases is based on a strict average of the risk levels of each row. As a result, the IPCO guidelines encourage more restricted data release as a primary means for risk-reduction. This is largely in line with a precautionary approach, and it acknowledges de-identification is inherently imperfect.

The IPCO’s calculation for context risk is more nebulous. Context risk for non-public and semi-public data releases is based on the risk of three different kinds of re-identification: deliberate insider attacks, inadvertent recognition of an insider by an acquaintance, and data breach. The document describes each of these in detail and provides formulas for quantifying the risk of each. The overall context risk is then calculated based on the risk of these attacks. Again, any public data release is assumed to have the highest level of risk. As such this metric prioritises safer release models over stronger de-identification. In short, the easiest way to reduce the risk of re-identification is to give fewer people more controlled access.

Finally, data is de-identified through a series of procedures depending on the data. Direct identifiers such as names and addresses are either removed or pseudonymized. Quasi-identifiers, such as ages, are either generalised or suppressed. In both cases, the purpose is to increase the size of the equivalence class. Generalisation removes specificity from data and groups rows together (such as creating age ranges rather than including specific ages). Suppression removes rows altogether that cannot be generalised.

Once data is passed through this process, the overall risk is reassessed using the process above. If, with the larger equivalence classes, the new overall risk is below a set of provided re-identification risk thresholds, the data can then be considered de-identified as per the IPCO guidelines.

As we see from this approach, the IPCO guidelines on de-identification acknowledge the high likelihood that data can be re-identified. The guidelines’ focus on risk-mitigation rather than absolute guarantees of anonymity reflect the demonstrated flaws of de-identification as an approach that is seen in the literature on re-identification.

The issue, however, is that by the time this has been translated to policy as in the cases bills C-11 and 64, the inherent riskiness of de-identified data is papered over. Both bills treat data in more or less absolute terms; it is either personal data or not personal data. And the line between these two is de-identification. We can see here that there is a disconnect between what de-identification procedures claim to do, and what legislation which mobilises de-identification claims of these procedures. De-identification is posited in legislation as a means of legitimising more invasive uses of data without necessitating informed consent. This legitimation is achieved by classifying the de-identified data as non-personal, but doing so obfuscates the necessarily imperfect process involved in getting there.

In some cases, as with both bills C-11 and 64, there is some acknowledgment of the risk of re-identification. C-11 addressed this by placing a prohibition on organisations re-identifying individuals from a de-identified dataset (C-11, 2020, §75) and required that de-identification procedures be “proportionate” to the sensitivity of the data in question (C-11, 2020, §74). Bill 64 takes largely the same approach, but also specifies fines for attempting to identify an individual from a de-identified dataset. These fines would be between $5,000 and $25,000,000, or up to 4% of a corporation’s worldwide turnover—whichever is greater (Bill 64, 2021, §160). Again, though, this only demonstrates the disconnect between de-identification processes and the legislation which mobilises it. The IPCO guidelines, for example, clearly lay out that not all risk of re-identification is malicious. Rather, one of the IPCO's three key re-identification vectors is “inadvertent recognition of an individual by an acquaintance" (Information and Privacy Commissioner of Ontario, 2016, p. 16). By taking a punitive approach preventing re-identification—assuming that the only threat is malicious re-identification—Bill C-11 and Bill 64 fail to grapple with the inherent leakiness of de-identified data. They problematize people rather than the data itself, and they rely on the good faith of data holders to properly manage data in the context of a regulatory carveout which is designed to enable non-consensual repurposing of data.

This binary approach to the regulation of data has also been identified as an issue by data holders. CANON, the Canadian industry lobby group advocating data de-identification, has recommended that governments adopt legislation which takes a more nuanced approach to de-identification and re-identification risk. In their recommendations to Industry, Science, and Economic Development Canada (ISED), CANON recommends “that ISED consider the adoption of a spectrum of identifiability rather than the existing black or white approach in which information is either identifiable or non-identifiable—completely in or out of PIPEDA’s ambit—respectively” (Canadian Anonymization Network, 2019b). However, for CANON, this would mean few restrictions on the use of data once it is considered sufficiently de-identified. They write that:

For example, information that poses no serious risk of re-identification could remain outside of PIPEDA, while information with a low risk of re-identification could be covered by PIPEDA, potentially exempted from consent … , but subject to other fair information principles as appropriate, including accountability, safeguarding and transparency (Canadian Anonymization Network, 2019b).

CANON’s proposed approach again presumes the infallibility of some forms of de-identification, so much so that it proposes that some uses of data should be entirely unregulated. CANON’s approach poses clear benefits to the large-scale data holders that comprise its membership but fails to engage with the inherent flaws in all forms of de-identification.

Overall, while de-identification has its uses in certain areas, it has clear issues as a general data governance framework. From a technical point of view, its appeal is in providing blanket regulatory guidelines for when and how data can be used. But in practice we see that this is not actually the case. De-identification techniques involve a high degree of nuance and context specificity which legislation largely needs to ignore to make it useful and enforceable from a governance perspective.

Conceptual Analysis and Limitations of De-Identification

These issues with de-identification as a governance tool are compounded by de-identification’s narrow conception of data harms as occurring solely on an individual scale. In this section, we use a data justice and a data democracy lens to demonstrate the need for a governance framework that considers data’s relationality, and which can conceptualise data harm at a structural level.

On Privacy as an Individual Good

The concept of de-identification problematizes data at an individual scale. This is the case both in its understanding of what needs to be protected—personal information—and in what is the highest form of infringement or risk—being re-identified.

When considering propositions to de-identify personal data, one may be tempted to feel reassured that their own personal information will not be made public. But this rests on two assumptions: that individuals are the sole owners and masters of their data, and what must be avoided is the circulation of one’s name, photo, along ‘sensitive’ information. This is clearly demonstrated in the models of data and data harm outlined in bills C-11 and 64, as they position de-identification as an alternative to consent. This model assumes that harm is primarily caused through the use of identifying data without explicit consent. But as Wendy Chun outlines in “Big Data as Drama” (2016), this reasoning is flawed. Indeed, Chun emphasises that computation works in networks and aggregates. To illustrate this, she brings forth the idea of neighbourhoods. As she explains, entities such as Netflix or Amazon “mine our data not simply to identify who we are (this, given our cookies and our tendency to customise our machines is very easy), but to identify us in relation to others ‘like us’ (2016, p. 370).” It is the profile we represent online— and particularly its relation to other profiles (neighbourhoods)—that produces data which can be used to predict, speculate, and feed large-scale models. It is only through these relations that data derives value in the context of a big data economy.

Under this model, there is now a relational and temporal component to privacy; one is impacted as part of a neighbourhood and as future subject of the algorithmic products. Following Chun’s concept of neighbourhoods, the networks which constitute big data are public, collective, and relational by nature; privacy protection needs to follow this model and scale. In line with Chun, Salomé Viljoen also argues in “Data as Property?” that “Data production places individuals in population-based relations with one another; the social effects that result cannot be adequately reduced to individualistic concerns nor can they be addressed via individual-centric institutions (Viljoen, 2020a, 2020b).”

Accordingly, Linnet Taylor prompts us to think of privacy at a group-level, as these collectives or neighbourhoods map onto existing structures of social power. She proposes that we think of privacy not in terms of owning our personal information, but as balancing the tension between visibility and invisibility (2017). However, Taylor’s approach develops this notion to describe the relationship between citizens and the State. By proposing to balance what data from a group should be visible to the State, Taylor reiterates that visibility or transparency alone are not locus of power enforcement. As Taylor outlines, holding this tension in data and privacy policy requires “A more detailed framing of the needs for both visibility and informational privacy [that] should take into account the work being done on privacy at the social margins, the risks to group privacy through collective profiling and the extent to which data may be considered a public good (2017, p. 9).” As such, the shift which Taylor proposes in our understandings of data and privacy are in accordance with principles of data justice and data democracy which will be discussed in the following sections.

Furthermore, Taylor’s approach of (in)visibility—balancing both negative and positive potential abuses—engages the use of data and the ways it continues to circulate and produce value once collected. If de-identified data can be used without people’s consent or knowledge, not only are certain groups are placed at greater risk of being hyper-visible (Gangadharan, 2012), but other groups may be erased from datasets all together (D’Ignazio & Klein, 2020). Taylor’s proposed (in)visibility approach to data attends to privacy concerns and addresses well documented limitations of data regulations. Whereas de-identification conceptualises visibility as identification, Taylor contends that what is visible is what can be ordered or placed in a rhythmic pattern (see Carmi, 2020). What is invisible is conceptualised as out of arm-reach or which is illegible to a given power structure. Taylor reminds us that when considering data at population scale, there is value and risk in both poles. For data governance frameworks to attend to collective harm, justice, and privacy, they must attend to both of these poles—even when they are in tension with each other.

Viljoen (2020a, 2020b) groups data regulation in two categories: propertarian and dignitarian. These categories are distinguished by the assumptions that they make about the nature of data as well as the type of risk or harm calling redress. She argues that the ‘propertarian’ understands data as labour or property unfairly distributed, while the ‘dignitarian’ understands data as individual rights that must be claimed and protected (2020a, 2020b). Viljoen demonstrates that both fail to account for the relational, collective, and structural scale of data as they are both rooted in individual rights. In light of Chun’s concept of neighbourhoods and Taylor’s proposed (in)visibility approach, de-identification’s assumptions about data—that it is property and that it is an individual right—are incorrect. These assumptions about data need to be changed to address the harm and risks produced by the relational and structural nature of data.

On Structural Harm: What Does De-identified Data Build and for Whom?

On a higher level, de-identification and privacy policies at large aim to protect the Canadian public. There is however a disconnect between the harm documented, and what policies relying on de-identification attempt to resolve.

The way data harms and algorithmic discrimination is imbricated in broader socio-political systems of power is now well documented. Safyia Noble’ (2018), Ruha Benjamin’ (2019) and Virginia Eubanks’ (2018) work focuses on making visible and documenting the ways in which data-driven technologies re-inscribe systems of oppression such as race and class while purporting to be neutral. Scholars such as Joanna Redden, Jessica Brand and Vanesa Terzieva (2020), and Lina Dencik & al. (2019), have documented how these harms exist at the structural, collective, and systemic rather than at the individual scale by compiling records of these accounts. We can think here of facial recognition, search engine optimization, and migrant surveillance as examples of structural harms which our current approaches to data governance—including de-identification—cannot address.

One way to prevent such harms is to ask the question: What does de-identified data serve to build and for whom? The answer points to the nuances of both which subject positions benefit from the technology, as well as who has power to shape and build these technologies. Catherine D’Ignazio and Lauren Klein (2020) as well as Sasha Costanza-Chock (2020), have extensively documented the ways in which data-driven technologies can reinforce relations of dominations.

As such, structural harm can be the result of both an individual conception of data but also the perceived neutrality of de-identified data. Centring the question of “What does de-identified data serve to build and for whom?” in data governance would allow us to engage in deeper conversations on who is differentially harmed by data, as well as allowing us to critically engage with the fact that these technologies are not neutral and therefore require scrutiny. As Orla Lynskey (2019) argues, thinking of data harm structurally is also a reminder of how data power is mobilised to nudge public policy and can overlap with market power. Importantly, data governance and privacy policies ought to consider both what data—de-identified or not—is being used to create and how is this structure reproducing historically created relations of oppression, discrimination, and domination.

This question of who and what benefits from de-identification of data is especially pertinent in light of lobbying efforts on the behalf of CANON and Canada’s major telecommunications providers in favour of de-identified data. In this instance, there is a clear overlap of both data and market power. Moreover, while the benefits to data holders have been well established, the benefits to consumers and citizens are less clear. It is imperative that these structural implications, which directly result from allowing for the widespread use of de-identified data, be thoroughly considered in the policy making process.

Alternative Lenses: Data Governance Beyond Privacy

Considering these issues, we contend that adopting a data justice and data democracy lens is a necessary first step towards an equitable and holistic data governance framework for Canada.

With the concept of “data democracy,” Salomé Viljoen (2020a; 2020b) invites us to think of data as a democratic resource for democratic ends. Such a framework accounts for the relational and collective nature of data, the way it maps onto existing structures and relations of power hence producing structural harms, and states the legitimate limit at which a government can collect, allow, and make use of data—de-identified or not.

With regards to “data justice,” we draw on the work of Linnet Taylor (2017) who prompts us to think of three “pillars” built to achieve principles of human capability as the end goal of data governance. These pillars are: (in)visibility, (dis)engagement, and anti-discrimination. (In)visibility, as we have described above, is concerned with maintaining a balance between what is and is not made to be legible by those with state or market power. (Dis)engagement then refers to the right to be part of and retract oneself from data collection and usage. And finally anti-discrimination attends to questions of structural harm and power imbalance. All three pillars account for positive and negative potentials of data use.

Overall, data justice and data democracy are two different approaches that share similar commitments. Data democracy is concerned with how we mobilise data as a democratic resource for democratic ends. Data Justice on the other hand emphasises that many tensions must be held in balance to enable justice through a human capability framework. Taken together, these frameworks enable an approach to data governance which contends with the collective, relational, and structural nature of data, while also working towards social justice.

Data Governance, Power, and the Fuzzy Public-Private Interface

Data’s relational and structural facets, which both data democracy and data justice draw attention to, are particularly important when considering data governance within the current data economy. Citizens provide personal data to their government as an exchange to access protections or services. We can think here of the collection of census data, or of the sharing of clinical data to enable COVID-19 vaccine research. In such instances, citizens willingly and knowingly trust and share personal information with their governments. At the same time, given the current data economy, governments are also in charge of protecting citizens' privacy and data rights. In many cases, these two duties are at odds with each other.

Salomé Viljoen outlines the tension between these roles by emphasising that governing with a commitment to public welfare “will always require balancing the necessity of collecting important, at times highly personal and consequential, information from citizenry, and the risk of oppression and undue coercion that accompanies any such collection (2020b, p. 59).” Similarly, Taylor highlights the way this exchange is formative and performative of the relationship between citizens and states; she provides the example of census-taking. Yet, this traditional relationship of governance is complicated by the nature of the data economy. Taylor asks this difficult yet crucial question: “If state population data is soon to be at least partly composed of commercially collected data and updated in real time, and those data can tell the government not only conventional facts about the population but instead almost everything, where does legitimate observation end and illegitimate surveillance begin (Taylor, 2017, p. 10)?”

The distinction between democratic participation and democratic exercise of power must be given consideration in data governance. As Taylor (2021) highlights, one way to parse these interests is to ponder over the question of legitimate collection and use of data. As she expands on in her paper “Public Actors Without Public Values: Legitimacy, Domination and the Regulation of the Technology Sector,” legitimacy needs to be determined in relation to citizens and public interest insofar as democratically determined uses of data are non-arbitrary because they engage the people who are impacted.

The complexities of the citizen-government relationship are compounded by the increasingly blurry line between public and private institutions. As Linnet Taylor has demonstrated, governments increasingly rely on the private-sector for products, consultant expertise, and processing power from the private-sector blurs the line between public and private (Taylor 2017, 2021). This can compromise government’s legitimacy in its use of data, especially in cases where decisions are made that are perceived as favouring private interests. In such cases, the relationship of trust between a government and its citizens is eroded.

To be clear, the issue we are referring to here is not the existence of a public-private interface. The existence of public-private data flows or the public use of private expertise is not inherently problematic. Rather, we are cautioning that governments must be accountable first and foremost to their population. Even in cases where such a conflict does not exist in practice, increasing government reliance on private-sector resources can create the appearance of a conflict—further eroding government legitimacy with relation to their use of data. As Taylor argues, the relationship between states and hired corporations and public funding recipients needs to be transparent to let citizens be part of this conversation that is currently taking place behind closed doors (Taylor, 2021). From this perspective, initiatives such as CANON are problematic insofar as they create the appearance of private-sector actors working against the public interest by pushing an agenda that favours increasingly opaque data governance regimes.

Towards a Relational Data Governance Framework

This changing governance ecosystem necessitates a data governance framework that accounts for the limits of de-identification as well as the relationality of data. The case of telecommunication service providers (TSPs) and internet service providers (ISPs) network data illustrates the ramifications of the relational nature of data and how it ought to inform privacy legislation and data governance.

One form of network data that has gained traction in recent years to monitor network security is NetFlow data. NetFlow data is the two unidirectional sets of metadata about network traffic (server to client, client to server) which is collected, used, and sold by TSPs for a myriad of purposes (Laman, 2019; Cox, 2021). Importantly, this type of network information can now be automatically collected and structured, and can also inform network automations such as the monitoring of ‘normal’ network activities, flagging malware and the use of botnets, and generally minimising risk to the network (Pérez et al., 2017).

For many cybersecurity experts, NetFlow data presents some crucial advantages over Deep Packet Inspection (DPI), a technique traditionally used to ensure network security. This is because DPI focuses on content analysis to monitor network activity, whereas NetFlow data represents relationships between nodes on the network. Ultimately, the relationships among the actors is more telling to both authorities and network security experts than the content of their discussions. For instance, clients referencing ‘terrorism’ in their content is a less trustworthy evidence of terrorist activities than a client computer communicating to a known terrorist. While DPI is still used (or used in conjunction with NetFlow data), NetFlow data is gaining traction as it is seen to be both more efficient at scale and less likely to run afoul of privacy legislation.

NetFlow data consists of five distinct kinds of information: source IP, destination IP, source port, destination port, protocol (Laman, 2019). Taken together, these pieces of data roughly indicate ‘who is talking to who, and for how long’. As per Yurcik et al., NetFlow data also contains private information such as “user-identifiable information (user content such as email messages and URLs) and user behaviour (access patterns, application usage) as well as machine/interface addresses such as IP and MAC addresses (2014, p. 2).” As such, ISPs/TSPs have a strong interest in advocating for the codification of de-identification, especially given that it would maximise the utility and profit that can be extracted from the NetFlow data they already possess.

The use and processing of NetFlow data exemplifies the tensions between network security, privacy and the broader socio-political concerns surrounding data governance. As we have demonstrated, current privacy and data regimes understand and aim to mitigate harm on an individual level. And at an individual level, NetFlow data poses very little risk. When NetFlow data is aggregated, however, and when this aggregated data is applied to cases outside of network security, this information tells a lot about a given population. Telus, for example, advocates for the use of aggregated NetFlow data as a way of establishing “population patterns” (Telus, 2019). In such cases, an individual’s choice to opt-in or opt-out of having their data collected is inconsequential when the effects of the data collection are enacted on a community or population level.

In the NetFlow data TSPs/ISPs produce and use, we identify three levels of relationality: infrastructural, human, and algorithmic. At the infrastructure level, NetFlow data is the product of clients and servers communicating to each other; it is the information surrounding the packets (where it comes from, where it is going). At the human scale, this indicates information about where devices/people are physically. Finally, at the algorithmic scale, compiled and processed sets of NetFlow data are often aggregated and analysed as neighbourhoods. All data functions with these various levels of relationality, which is why individual scale protection of citizens is flawed; we are all analysed and nudged as groups and as such, harms and risks also occur at the group level.

Conclusion

As we have demonstrated, de-identification in Canada’s data governance frameworks—as exemplified through bills C-11 and C-64—is used to enable the processing, transit, and portability of data without individuals’ consent. This approach draws on models from the healthcare sector and the European Union’s GDPR, as well as recommendations from industry groups such as CANON leading up to the tabling of these bills. Approaching data governance in this way poses both technical and conceptual challenges. Technically, this approach fails to account for the leakiness of de-identified data; re-identification is a persistent risk, and while steps can be taken to reduce the likelihood of re-identification it is impossible to remove the risk altogether. By treating data as a binary—either personal or not—and making de-identification the standard for moving between those two poles, this approach provides insufficient protections because it mis-conceptualises both data and data risk.

Subsequently, we applied both a Data Justice and a Data Democracy lens in order to consider the conceptual limits of de-identification policy. In particular, these frameworks demonstrate that the fundamental problem that de-identification attempts to solve fails to engage with the relational nature of data. As we have outlined, many risks arise from the current, inadequate problematization of privacy as an individual good, from the failure to recognise structural data harms (what is data building, and for whom), and the power imbalance (who owns and governs with data). Additionally, we emphasise the importance of accounting for the fuzziness of the private-public interface, as well as the relational nature of data.

The appeal of data de-identification as a data governance mechanism is that it can provide a clear and actionable mechanism for enabling a privacy preserving data economy. In practice, however, de-identification is neither as clear cut or as privacy preserving as it would appear. Moreover, data de-identification's narrow focus on risk to individuals means that even in the ideal case, it fails to address structural data harms. As it has been employed in policy and regulation, de-identification carves out space for opaque data practices, which raises significant concerns in the context of a concentrated data economy. As the data justice and data democracy frameworks demonstrate, data's relational and structural dimensions must be considered when assessing potential data harms. Allowing for opaque practices without considering these dimensions not only means that those harms will not be prevented, but it also makes these harms more likely to occur.

Bibliography

Bill C-11: An Act to enact the Consumer Privacy Protection Act and the Personal Information and Data Protection Tribunal Act and to make consequential and related amendments to other Acts, C–11, House of Commons of Canada, Second Session, Forty-third Parliament, 69 Elizabeth II, 2020 (2020).

Barth-Jones, D. (2012). The “Re-Identification” of Governor William Weld’s Medical Information: A Critical Re-Examination of Health Data Identification Risks and Privacy Protections, Then and Now (SSRN Scholarly Paper ID 2076397). Social Science Research Network.

Bell Media. (2021, November 1). Bell Media Launches Bell DSP, a New Ad-Tech Platform for Advertisers. Bell Media.

Benjamin, R. (2019). Race After Technology: Abolitionist Tools for the New Jim Code. Polity.

Boutilier, A. (2022, January 13). Canada’s Privacy Watchdog Probing Health Officials’ Use of Cellphone Location Data. Global News.

Canadian Anonymization Network. (2019a). CANON | Canadian Anonymization Network. CANON | Canadian Anonymization Network.

Canadian Anonymization Network. (2019b, October 15). Submission re: ISED’s “Strengthening Privacy for the Digital Age.”

Carmi, E. (2020). Rhythmedia: A Study of Facebook Immune System. Theory, Culture & Society, 37(5), 119–138.

Cavoukian, A., & Castro, D. (2014). Big Data and Innovation, Setting the Record Straight: De-identification Does Work. Information and Privacy Commissioner Ontario, Canada.

Chun, W. H. K. (2016). Big Data as Drama. ELH, 83(2), 363–382.

Costanza-Chock, S. (2020). Design Justice: Community-led Practices to Build the Worlds We Need. The MIT Press.

Cox, J. (2021, August 24). How Data Brokers Sell Access to the Backbone of the Internet. Vice.

Culnane, C., Rubinstein, B. I. P., & Teague, V. (2017). Health Data in an Open World. ArXiv:1712.05627 [Cs].

Dencik, L., Hintz, A., Redden, J., & Treré, E. (2019). Exploring Data Justice: Conceptions, Applications and Directions. Information, Communication & Society, 22(7), 873–881.

D’Ignazio, C., & Klein, L. F. (2020). Data Feminism. The MIT Press.

Doctorow, C. (2014, July 9). Big Data Should Not Be a Faith-based Initiative. Boing Boing.

Environics Analytics. (2020, December 1). EA Partners with Bell. Environics Analytics.

Eubanks, V. (2018). Automating Inequality: How High-tech Tools Profile, Police, and Punish the Poor. St. Martin’s Press.

European Commission. (2021, June 28). Adequacy Decisions. European Commission - European Commission.

Regulation on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), Pub. L. No. 2016/679, 1 (2016).

Freedom of Information and Protection of Privacy Act, Pub. L. No. R.S.O. 1990, CHAPTER F.31 (2021).

Gangadharan, S. P. (2012). Digital Inclusion and Data Profiling. First Monday, 17(5).

Ghinita, G., Kalnis, P., & Tao, Y. (2011). Anonymous Publication of Sensitive Transactional Data. IEEE Transactions on Knowledge and Data Engineering, 23(2), 161–174.

Health Canada. (2019). Public Release of Clinical Information: Guidance Document.

Hemmadi, M. (2019, April 25). Privacy Commissioner Taking Facebook to Court to Try and Force Privacy Changes. The Logic.

Hern, A. (2017, August 1). “Anonymous” Browsing Data Can Be Easily Exposed, Researchers Reveal. The Guardian.

Hintze, M. (2018). Viewing the Gdpr Through a De-identification Lens: A Tool for Compliance, Clarification, and Consistency. International Data Privacy Law, 8(1), 86–101.

Huser, V., & Shmueli-Blumberg, D. (2018). Data Sharing Platforms for De-identified Data from Human Clinical Trials. Clinical Trials, 15(4), 413–423.

Information and Privacy Commissioner of Ontario. (2016). De-identification Guidelines for Structured Data.

Laman, A. (2019). Network Flow Data: A Cornucopia of Value. The Blue Team Summit, SANS Institute.

Bill 64: An Act to Modernize Legislative Provisions as Regards the Protection of Personal Information, 64, Assemblée Nationale du Québec, 42nd Legislature, 1st Session (2021).

Lynskey, O. (2019). Grappling with “Data Power”: Normative Nudges from Data Protection and Privacy. Theoretical Inquiries in Law, 20(1), 189–220.

Narayanan, A., & Felten, E. (2014). No Silver Bullet: De-identification Still Doesn’t Work. Princeton Center for Information Technology.

Narayanan, A., Huey, J., & Felten, E. W. (2016). A Precautionary Approach to Big Data Privacy. In S. Gutwirth, R. Leenes, & P. De Hert (Eds.), Data Protection on the Move (Vol. 24, pp. 357–385). Springer Netherlands.

Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. New York University Press.

Parsons, C. (2022). Standing Committee on Access to Information, Privacy and Ethics: Study on Collection and Use of Mobility Data by the Government of Canada. Citizen Lab.

Pérez, M. G., Celdrán, A. H., Ippoliti, F., Giardina, P. G., Bernini, G., Alaez, R. M., Chirivella-Perez, E., Clemente, F. J. G., Pérez, G. M., Kraja, E., Carrozzo, G., Calero, J. M. A., & Wang, Q. (2017). Dynamic Reconfiguration in 5G Mobile Networks to Proactively Detect and Mitigate Botnets. IEEE Internet Computing, 21(5), 28–36.

Redden, J., Brand, J., & Terzieva, V. (2020, August). Data Harm Record. Data Justice Lab.

Rocher, L., Hendrickx, J. M., & de Montjoye, Y.-A. (2019). Estimating the Success of Re-identifications in Incomplete Datasets Using Generative Models. Nature Communications, 10(1), 3069.

Rosner, G. (2019). De-identification as Public Policy. Office of the Privacy Commissioner of Canada.

Roth, A. (2018, October 19). Privacy Expert Ann Cavoukian Resigns as Adviser to Sidewalk Labs. The Logic.

Roth, A. (2019, June 5). Several Big Tech Critics Urge City of Toronto to Abandon Sidewalk Labs Smart-city Project. The Logic.

Taylor, L. (2017). What Is Data Justice? The Case for Connecting Digital Rights and Freedoms Globally. Big Data & Society, 4(2), 2053951717736335.

Taylor, L. (2021). Public Actors Without Public Values: Legitimacy, Domination and the Regulation of the Technology Sector. Philosophy & Technology.

Telus. (n.d.). Data Insights for Social Good—Data for Good. Telus. Retrieved November 16, 2021

Telus. (2019, November 21). Data Analytics & Opt-Out—Privacy. TELUS.

Trudeau, J. (2021, December 16). Minister of Innovation, Science and Industry Mandate Letter.

Viljoen, S. (2020a, October 16). Data as Property? Phenomenal World.

Viljoen, S. (2020b). Democratic Data: A Relational Theory For Data Governance (SSRN Scholarly Paper ID 3727562). Social Science Research Network.

Yurcik, W., Woolam, C., Khan, L., & Thuraisingham, B. (2014, June). A Software Tool for Multi-Field MultiLevel NetFlows Anonymization.

Date modified: