Featured Product
This Week in Quality Digest Live
Risk Management Features
Gleb Tsipursky
Remodeling the work setup will help teams shift successfully across post-pandemic workplaces
Emily Newton
Consumers want to know that EVs are safe and reliable
Amrou Awaysheh
The world’s pledges so far aren’t enough to stop climate change, U.N. data show
Leo Simonovich
Siemens Energy’s AI-based monitoring and detection platform provides intelligent illumination for IoT cyber defense
Knowledge at Wharton
The global supply chain is affected by the pandemic but also China’s increased consumption of its own products

More Features

Risk Management News
Partnership embeds quality assurance at every stage of the product life cycle, enables agile product introduction
Both quality professionals and their business leaders agree that openness and communication is essential to moving forward
Greenlight Guru and MedCrypt partner to secure medical devices
Address equipment issues before a catastrophic failure occurs
Higher quality contributes to higher efficiency and less downtime
Design, develop, implement, continually improve risk management in systems and software engineering
ISO/IEC/IEEE 16085 has just been updated
Galileo’s Telescope describes how to measure success at the top of the organization, translate down to every level of supervision
NSF-funded project is developing a model to help manufacturers pivot and produce personal protective equipment

More News

Risk Management

Summation and Average Queries: Detecting Trends in Your Data

The importance of upper limits and clipping

Published: Tuesday, May 11, 2021 - 12:03

In our last article, we discussed how to determine how many people drink pumpkin spice lattes in a given time period without learning their identifying information. But say, for example, you would like to know the total amount spent on pumpkin spice lattes this year, or the average price of a pumpkin spice latte since 2010. You’d like to detect these trends in data without being able to learn identifying information about specific customers to protect their privacy. To do this, you can use summation and average queries answered with differential privacy.

In this article, we will move beyond counting queries and dive into answering summation and average queries with differential privacy. Starting with the basics: In SQL, summation and average queries are specified using the SUM and AVG aggregation functions:

SELECT SUM(price) FROM PumpkinSpiceLatteSales WHERE year = 2020
SELECT AVG(price) FROM PumpkinSpiceLatteSales WHERE year > 2010

In Pandas, these queries can be expressed using the sum() and mean() functions, respectively. But how would we run these queries while also guaranteeing differential privacy?

Differentially private summation

In our previous article we introduced the concept of sensitivity—how much a query’s output changes when a row is added or removed in the database—and analyzed the sensitivity of counting queries. When a row is added to the database, the result of a counting query can increase by at most one. With this simple bound on sensitivity, it’s easier to answer counting queries with differential privacy. Computing sums and averages is slightly more involved.

So, how much does a summation query change when a row is added to a database? It depends on the row! If someone spends $1 on a pumpkin spice latte, then the increase in the sum will be $1. If someone spends $10,000, the sum will increase much more.

To achieve differential privacy, we need an upper limit on the largest possible increase we could see when a row is added. For our latte query, that means we need an upper limit on the price of a pumpkin spice latte. This is a big challenge; no matter what limit we set, hypothetically, there may be a cafe somewhere that charges more than the limit.

The solution to this problem is called clipping. The idea is to enforce an upper limit, instead of assuming one. Lattes that cost more than the limit are clipped so that their price is equal to the limit. After clipping, we are guaranteed that all values in our database lie between the lower and upper limits we set. Clipping is a common operation in data analysis; Pandas even has a built-in function for it.

Because clipping enforces guaranteed lower and upper bounds on the data, we can use those bounds to determine sensitivity. If we clip the data so that lattes cost at most $10, then the largest increase we will see in the output of our summation query is $10 when we add a single latte sale to the database. Having bounded the sensitivity, we can add noise to the query’s output using the Laplace mechanism and achieve differential privacy.

The clipping parameters (the upper and lower limits) are extremely important for accuracy. If the upper limit is too high, we’ll add a lot of unnecessary noise. If it’s too low, then we’ll lose information that was present in the data by modifying too many of the data points.

Differentially private averages

Unfortunately, bounding the sensitivity of average queries is even more difficult than it is for summation queries. How much does an average change if we add a row to the database? In addition to the upper limit on the data values themselves, the answer depends on how many things we are averaging. If we’re averaging five numbers, adding one more number might change the average by quite a bit. If we’re averaging 5 million numbers, adding one more probably won’t change the average very much.

As a general rule, however, the sensitivity of a query shouldn’t depend on the data. Otherwise, the sensitivity might itself be sensitive—meaning that it might reveal something about the data. This adds another level of complexity to bounding the sensitivity of averages.

A simple and effective solution for answering an average query using differential privacy is by taking an indirect approach: Split the query into two separate queries, a summation query and a counting query. We will use the approach described above for the summation query, and the approach described in our last article for the counting query. To split our example query, we compute instead the two following queries:

SELECT SUM(price) FROM PumpkinSpiceLatteSales WHERE year > 2010
SELECT COUNT(*) FROM PumpkinSpiceLatteSales WHERE year > 2010

The average is then computed by dividing the result of the first query by the result of the second. Even if we compute each query using differential privacy, how do we know that the final average is differentially private? We don’t actually need to take any extra differential privacy steps; the result will automatically satisfy privacy. This follows from the post-processing property of differential privacy—to extend the list of properties we began in our first article—which promises that combining differentially private results in any way will also be differentially private—and with no extra effort!

Software tools

The practical systems for differential privacy referenced in the last article also provide support for sums and averages. Both Google’s differential privacy library and IBM’s Diffprivlib provide algorithms that compute differentially private bounded sums, leveraging clipping as described above. Both libraries also provide utilities for helping analysts to determine the bounds automatically.

The libraries additionally provide solutions for calculating averages. Google’s library also provides a more sophisticated algorithm, relative to the “sum divided by count” approach we’ve described above, which can sometimes give better accuracy.

Both Google’s differential privacy library and IBM’s Diffprivlib are indexed in NIST’s Privacy Engineering Collaboration Space, where you are invited to share feedback or use cases related to these tools.

Coming up next

So far, we’ve looked at the basics of answering counting, summation, and average queries with differential privacy. In these scenarios, we’ve assumed that you have a single question about a dataset. What if you’re interested in a more sophisticated analysis of your data, with a set of different questions, to learn about various trends from different perspectives? For example, you may want to know the average number of lattes purchased per age group, and also separately per geographic region. In the next article, we’ll welcome a guest author from academia to walk us through considerations and tools available to conduct this type of analysis with differential privacy.

First published Dec. 17, 2020, on NIST’s Cybersecurity Insights blog.

Discuss

About The Authors

David Darais’s picture

David Darais

David Darais is a principal scientist at Galois and supports NIST as a moderator for the Privacy Engineering Collaboration Space.

Joseph Near’s picture

Joseph Near

Joseph Near is an assistant professor of computer science at the University of Vermont who supports NIST as a moderator for the Privacy Engineering Collaboration Space.