Yes, websites track your behavior online. But some go much further than what you’d reasonably expect, using so-called session replays to create a detailed log of everything you do and type on a site. And new research shows that in some cases these movie-like recordings are even storing your passwords.
Bulk data collection is always a privacy red flag. But the Princeton research group that first published findings about session replay scripts has uncovered a troubling series of situations where seemingly well-intentioned safeguards fail, leading to an unacceptable level of exposure.
The investigation started with Mixpanel, a product analytics company that offers a comprehensive user data collection service known as Autotrack. The company admitted in an email to its customers at the beginning of February that the feature had been unintentionally collecting password data, even though Autotrack includes heuristics meant to prevent that very thing. Autotrack isn’t a session replay script, but it collects whole-hog user interaction data so that Mixpanel’s clients can query later for any information about their users. Mixpanel corrected the password flaw and issued an SDK update, but the Princeton researchers—Steven Englehardt, Gunes Acar, and Arvind Narayanan—say they realized that these types of password redaction failures were probably a larger problem.
“It kind of snowballed and I think it’s likely that there are other design patterns out there that are also weakened,” says Englehardt, a web privacy PhD candidate. “We’ve highlighted some, but we could continue to go down this road and find other things again and again just because of the way that these scripts are designed.”
You Shall Not Password
Even after Mixpanel issued fixes for the password retention issue, the Princeton researchers still found situations in which Autotrack recorded passwords. The feature tries to avoid retaining passwords by automatically redacting input fields that have a name or ID that includes the term “pass.” The limitations are obvious: A password field might, say, be named “pwd,” or a site might use a language other than English.
‘These leaks will happen no matter what unless they stop collecting all inputs from fields.’
Günes Acar, Princeton
One prevalent example the group found centers on “Show Password” features—tools offered by many sites and browser extensions that allow users to see the password they’re entering in plaintext so they can catch typos. The researchers discovered that on certain Mixpanel client sites, like testbook.com, the feature confused the password redaction protections. If a user clicked Show Password and then took any other action, like re-obscuring the password or editing it in the text field, Autotrack recorded the password, even if the user decided not to log in and didn’t submit it. This happens when the Show Password feature stores the password in a second invisible field, so Autotrack is collecting it from that second field, which it doesn’t know to classify as sensitive. The researchers found that this problem also came up when users added Show Password browser extensions to the mix, altering website behavior in ways neither the site nor its third-party services control.
“The structure of the rendered webpage is being modified, changing the type of input field from a password field to a regular text field. When this happens, Autotrack loses the ability to identify whether or not a field is being used to enter a password,” Mixpanel said in a statement. “Per our documentation, if a customer is collecting sensitive information in non-password fields, they should explicitly blacklist it for collection.”
Mixpanel has also put its entire Autotrack feature “on hold” in recent weeks, making the tool inaccessible to new users while the company “evaluate[s] how to provide seamless, easy integration of Mixpanel in a way that’s transparent and predictable to our customers.” A spokesperson said that the company has realized that some of its customers didn’t understand how much data Autotrack collected, and wanted more control over what information the tool retained. Mixpanel also says it is developing mechanisms to make it easier for customers to review the totality of the data the feature collects, so they can more quickly spot things that don’t belong.
The researchers crawled publicly available data for the Alexa top 50,000 sites, looking at samples of thousands of sites at different tiers of popularity, and found examples of misbehaving session replay scripts at all levels. Not every site that uses replay sessions will retain sensitive data—they may not scan pages where users enter personal data or may correctly implement protective blacklists—but the relative ease of finding examples indicates that the issue is widespread.
The researchers don’t believe that any of the analytics firms they’ve studied intend to collect the sensitive data, or do it with malicious intent—unlike some hackers. And working with the companies and impacted websites has motivated a number of improvements. But they note that the privacy concerns are diverse and exist on an increasingly massive scale. “We’ve had responses from the vendors, they promise to do more on detecting these kinds of leaks,” says Günes Acar, a postdoctoral researcher at Princeton who studies online tracking. “But these leaks will happen no matter what unless they stop collecting all inputs from fields. I’m not really very optimistic.”
Gotta Catch ‘Em All
In addition to Mixpanel, the researchers looked at examples of accidental password collection involving three other firms that specifically offer session replay services—UserReplay, FullStory, and SessionCam. The researchers continued looking at Show Password features and browser extensions and found a number of situations in which privacy protections break down.
Often this occurs even when the user doesn’t actually use the Show Password feature, simply because entering a password creates that additional invisible field that holds the password in plaintext in case the user wants to Show Password. Many session replay scripts fail to exclude this second password field, as was the case with FullStory and a service called PropellerAds.
“Session replay is a uniquely effective technology that helps businesses fix bugs, offer excellent support, and make their websites easier to use,” a FullStory spokesperson told WIRED. “We believe there is an opportunity to ensure that session replay and privacy concerns are not at odds, and we have a team internally dedicated to this effort. The work that Gunes [Acar] and Steve [Englehardt] are doing at Princeton can only help us get better.”
‘To me the worst case scenario is nothing happens, nothing changes and this is just the new normal.’
Steven Englehardt, Princeton
One particularly odd example the researchers found comes from Capella University’s admission login page, which produced a password leak through the interaction of two third-party tools. When a user typed a password in the password field, an Adobe Analytics ActivityMap script stored the password in a cookie. At the same time, the session replay service UserReplay was set up on the page to record all generated cookies. As a result, UserReplay was inadvertently collecting passwords.
“Out of an abundance of caution, we are removing the Adobe Analytics code that sets the cookie and we’re suspending our use of UserReplay,” Capella public affairs vice president Mike Buttry told WIRED. “We take student data protection very seriously.” UserReplay did not yet return a request for comment, but the researchers say that the situation is extremely unusual. Rather than reflecting a common problem, it shows how unpredictable third-party interactions on web pages can be, whether they come from analytics tools or completely dark horse services like Show Password extensions.
The password leaks the researchers found aren’t direct exposures, because they leak from one service to another rather than out into the public sphere. But unintentional data exposures do increase the overall risk that data will someday publicly leak or be breached. The more copies of sensitive information that exist, the broader the attack surface, and when data is being collected accidentally it may not be stored properly or have standard protections.
“To me the worst case scenario is nothing happens, nothing changes and this is just the new normal,” Englehardt says. “When we first started talking about session replays people were surprised, but over time that surprise all goes away and this becomes an acceptable risk. I hope that our findings encourage companies to change their practices, not just patching the specific things we point out, but really change the design of the product.”
As long as bulk analytics data and session replays help companies improve their user experiences, optimize their products, and do better marketing, though, radically altering these services will be a tough sell.
Updated 2/26/2018 2:45pm EST to include a response from Capella University.