At 1 a.m. on May 7, the website PatientsLikeMe.com noticed
suspicious activity on its 'Mood'
discussion board. There, people exchange highly personal stories about their
emotional disorders, ranging from bipolar disease to a desire to cut themselves.
5月7日凌晨1点,网站PatientsLikeMe.com注意到其"心情"讨论板上有可疑活动。在这个讨论板上,人们交流非常私密的情绪失调经历,内容从躁郁症到自残欲望,可谓五花八门。
It was a break-in. A new member of the site, using sophisticated software, was 'scraping,' or copying, every single message off PatientsLikeMe's private online forums.
这时出现了一位不速之客。一位新成员用复杂的计算机软件"搜集"--也就是复制PatientsLikeMe私人在线论坛上的每一条信息。
PatientsLikeMe managed to block and
identify the
intruder: Nielsen Co., the
privately held New York media-
research firm. Nielsen
monitors online 'buzz' for clients, including major drug makers, which buy data gleaned from the Web to get
insight from consumers about their products, Nielsen says.
PatientsLikeMe设法阻拦了这名入侵者并确认了其身份:尼尔森公司(Nielsen Co.),一家私人持股的纽约媒体搜索公司。尼尔森公司说,他们为包括大型制药企业在内的客户监听在线交流内容,这些客户购买从网上搜集的数据,以了解消费者对其产品的看法。
'I felt
totally violated,' says Bilal Ahmed, a 33-year-old
resident of Sydney, Australia, who used PatientsLikeMe to connect with other people
suffering from
depression. He used a pseudonym on the message boards, but his PatientsLikeMe
profile linked to his blog, which contains his real name.
家住澳大利亚悉尼市、今年33岁的艾哈迈德(Bilal Ahmed)说,我感觉自己的隐私受到了严重侵犯。艾哈迈德通过PatientsLikeMe同其他受抑郁症困扰的人进行联络。他在留言板上用的是网名,但是他的PatientsLikeMe账户数据与他的博客相连,那上面有他的真名。
After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. 'It was very disturbing to know that your information is being sold,' he says. Nielsen says it no longer
scrapes data from private message boards.
PatientsLikeMe将网站被入侵的事告知用户后,艾哈迈德删掉了他所有的贴子以及他使用的药品的名单。他说,知道自己的信息被出售是件很烦心的事。尼尔森公司说,该公司将不再从私人留言板上搜集信息。
The market for personal data about Internet users is booming, and in the vanguard is the practice of 'scraping.' Firms offer to
harvest online conversations and collect personal details from social-net
working sites, resume sites and online forums where people might discuss their lives.
互联网用户个人信息市场正在蓬勃发展,而发展最快的领域之一便是"信息搜集"。许多企业都提供获取在线聊天内容的服务,以及从人们可能谈论其生活的社交网站、简历网站和在线论坛上搜集个人详细信息的服务。
The emerging business of web scraping provides some of the raw material for a rapidly expanding data
economy. Marketers spent $7.8
billion on online and offline data in 2009, according to the New York
management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.
网络信息搜集这种新兴业务为迅速扩张的数据经济提供了良好的支撑。根据纽约管理咨询公司Winterberry Group公布的数据,2009年,营销人员购买在线数据和离线数据花费的金额为78亿美元。2012年购买在线数据花费的金额将是2009年的两倍多,从4.10亿美元增长到8.40亿美元。
The Wall Street Journal's
examination of scraping -- a trade that involves personal information as well as many other types of data -- is part of the newspaper's
investigation into the business of tracking people's activities online and selling details about their
behavior and personal interests.
《华尔街日报》(The Wall Street Journal)调查了跟踪人们在线活动并出售有关其行为和个人兴趣详细信息的业务,信息搜集──该业务涉及个人信息,也涉及其它多类数据--是该项调查的一个部分。
Some companies collect personal information for detailed
background reports on individuals, such as email addresses, cell-phone numbers, photographs and posts on social-network sites.
有些公司收集个人信息以做出关于个人详细背景的报告,这些信息包括社交网站上的电邮地址、手机号码、照片和贴子等。
Others offer what are known as listening services, which
monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are
saying about
specific products or topics.
还有些公司提供的是侦听服务,这种服务可以实时监听成千上万的新闻来源、博客和网站,以了解人们对特定产品或问题的意见。
One such service is offered by Dow Jones & Co.,
publisher of the Journal. Dow Jones collects data from the Web, which may include personal information, that help determine how corporate clients are covered in news articles, blogs and
discussion boards. It says it doesn't gather information from password-protected parts of websites.
《华尔街日报》的出版商道琼斯公司(Dow Jones & Co.)就提供此类服务。道琼斯公司从网络上收集信息--其中可能也包括个人信息,用来帮助确定新闻报导、博客和讨论板上企业客户的覆盖情况。道琼斯公司称,该公司不会从网站上受密码保护的部分搜集信息。
The
competition for data is
fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.
数据搜集领域的竞争非常激烈。PatientsLikeMe也出售用户数据。PatientsLikeMe说,该网站出售的数据是匿名的,其中不含姓名信息。
Nielsen
spokesman Matt Anchin says the company's reports to its clients include
publiclyavailable information gleaned from the Internet, 'so if someone decides to share
personally identifiable information, it could be included.'
尼尔森公司的发言人安尚(Matt Anchin)说,该公司向客户提供的报告包括可以从互联网上公开搜集到的信息,因此如果有人决定共享那些可识别其身份的信息,那么这种信息就可能包括在其中。
Internet users often have little
recourse if
personally identifiable data is
scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their
profiles under certain circumstances.
如果可识别身份的个人数据被搜集,互联网用户往往求助无门:美国国内还没有法律要求数据公司让人们删除或变更个人信息,尽管有些公司允许用户在特定情况下删除其账户数据。
California has a special
protection for public officials, including politicians, sheriffs and district attorneys. It makes it easier for them to remove their home address and phone numbers from these databases, by filling out a special form stating they fear for their safety.
加利福尼亚州制订了公务员,包括政治家、地方治安官和地方检察官的个人信息保护措施。对他们来说,填写一张特殊表格说明其对自身安全问题的担心,便可以很方便地从这些数据库中删除其家庭住址和电话号码。
Data brokers long have scoured public records, such as real-estate transactions and
courthouse documents, for information on individuals. Now, some are adding online information to people's
profiles.
一直以来,数据经纪人都在搜寻公共记录以获取个人信息,例如房地产交易记录和法院文件。现在,有些数据经纪人在所搜集的资料中加入了网络信息。
Many
scrapers and data brokers argue that if information is
available online, it is fair game, no matter how personal.
许多数据搜集公司和数据经纪人认为,如果信息在网络上可以获取到,那么搜集信息就是合理的,不管这种信息有多私密。
'Social networks are becoming the new public records,' says Jim Adler, chief
privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include
criminalbackground checks and 'Date Check,' which promises details about a
prospective date for $14.95.
一家名为Intelius Inc.的著名付费个人信息搜索网站的首席隐私长阿德勒(Jim Adler)说,社交网站成为了新的公共记录。该网站提供的服务包括犯罪背景调查和"约会对象侦测器",关于一位潜在约会对象的详细信息售价14.95美元。
'This data is out there,' Mr. Adler says. 'If we don't bring it to the consumer's attention, someone else will.'
阿德勒说,这些数据就摆在那里,就算我们不把这些数据提供给客户,别人也会这么做的。
New York-based PeekYou LLC has
applied for a
patent for a method that, among other things, matches people's real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou's people-search website offers records of about 250 million people,
primarily in the U.S. and Canada.
位于纽约的PeekYou公司为一种方法申请了专利,这种方法可以将人们的真名与他们在博客、推特(Twitter)和其它社交网站上用的网名相匹配。PeekYou的个人信息搜索网站提供2.5亿人的信息记录,其中主要是美国和加拿大的网民。
PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are
monitoring. It says it hands over only demographic information, not names.
PeekYou称,该网站也开始与侦听服务机构合作,以帮助他们更好地了解他们所监听的人。该网站称,它交给客户的只是统计信息,而不是姓名。
Employers, too, are
trying to figure out how to use such data to
screen job candidates. It's tricky: Employers
legally can't discriminate based on gender, race and other factors they may glean from social-media
profiles.
雇主们也试图弄清楚如何用这种数据筛选求职者。这很难办:按照法律,雇主们不能根据性别、种族和他们从社交媒体账户数据中搜出的其它因素区别对待求职者。
One company that
screens job applicants for employers, InfoCheckUSA LLC in Florida, began
offeringlimited social-net
working data -- some of it
scraped -- to employers about a year ago. 'It's slowly starting to grow,' says Chris Dugger, national
accountmanager. He says he's particularly interested in things like whether people are 'talking about how they just ripped off their last employer.'
佛罗里达州一家名为InfoCheckUSA的公司为雇主提供筛选求职者的服务,他们在大约一年前开始向雇主提供有限制的社交网站数据--其中一些是该公司搜集得来的。该公司全国客户经理达格尔(Chris Dugger)说,这种业务如今开始缓慢增长。他说,他对人们是否会谈论他们如何欺骗上一任老板之类的话题尤其感兴趣。
Scrapers
operate in a legal gray area. Internationally, anti-scraping laws vary. In the U.S., court rulings have been contradictory. 'Scraping is ubiquitous, but questionable,' says Eric Goldman, a law professor at Santa Clara University. 'Everyone does it, but it's not
totally clear that anyone is allowed to do it without permission.'
信息搜集机构游走在法律的灰色地带。各个国家的反信息搜集法律五花八门,无一定之规。在美国,不同法院对此类案例的判决互相矛盾。圣克拉拉大学(Santa Clara University)的法学教授戈德曼(Eric Goldman)说,信息搜集普遍存在,但其正当性值得商榷。大家都已经在做了,但是否任何人都可以在不经允许的情况下这么做尚不明确。
Scrapers and listening companies say what they're doing is no different from what any person does when
gathering information online -- they just do it on a much larger scale.
信息搜集公司和侦听公司说,他们所做的与搜集在线信息的人所做的没什么不同--只是比后者规模大得多而已。
'We take an incomprehensible
amount of information and make it intelligent,' says Chase McMichael, chief
executive of Infinigraph, a Palo Alto, Calif., 'listening service' that helps companies understand the likes and dislikes of online customers.
加利福尼亚州帕洛阿尔托(Palo Alto)Infinigraph公司的执行总裁麦克迈克尔(Chase McMichael)说,我们搜集大量信息,并将其加以巧妙运用。这家公司是帮助企业了解在线客户好恶的侦听服务机构。
Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will
scrape a million Web pages for $101. One Utah company,
screen-
scraper.com, offers do-it-yourself scraping software for free. The top listening services can
charge hundreds of thousands of dollars to
monitor and analyze Web
discussions.
信息搜集服务机构提供各种档次的服务,既有非常便宜的,也有为客户量身定制的。有些公司,例如得克萨斯州的80Legs.com,搜集100万个网页的价格为101美元。犹他州普罗沃(Provo)一家名为
screen-
scraper.com的公司免费提供自助搜集软件。顶级侦听服务机构监听和分析网络讨论的费用则可能高达几十万美元。
Some
scrapers-for-hire don't ask clients many questions.
有些信息搜集公司不会向客户提出许多问题。
'If we don't think they're going to use it for
illegal purposes -- they often don't tell us what they're going to use it for -- generally, we'll err on the side of doing it,' says Todd Wilson, owner of
screen-
scraper.com, a 10-person company in Provo, Utah, that
operates out of a two-room office. It is one of at least three firms in a scenic area known locally as 'Happy Valley' that
specialize in scraping.
screen-
scraper.com公司的老板托德•威尔逊(Todd Wilson)说,如果我们认为他们不会将信息用于非法用途--他们通常是不会告诉我们这些信息的用途的--那么,我们通常会犯下大错。这家公司只有10名员工,在一间有两个房间的办公室里办公。在这个当地称为"快乐谷"的风景区中,至少有三家专门从事信息搜集的公司,
screen-
scraper.com就是其中之一。
Screen-
scraper
charges between $1,500 and $10,000 for most jobs. The company says it's often hired to conduct 'business intelligence,'
working for companies who want to
scrape competitors' websites.
Screen-
scraper的多数业务收费在1,500美元至10,000美元之间。该公司称,他们经常受雇进行商业情报工作,为那些希望搜集竞争者网站信息的公司服务。
One recent
assignment: A major insurance company wanted to
scrape the names of agents
working for competitors. Why? 'We don't know,' says Scott Wilson, the owner's brother and vice p
resident of sales. Another job: attempting to
scrapeFacebook for a multi-level marketing company that wanted email addresses of users who 'like' the firm's page -- as well as their friends -- so they all could be pitched products.
最近该公司接下的一笔业务是:一家大型保险公司希望搜集为其竞争者工作的代理商的名录。为什么?老板托德•威尔逊的弟弟兼销售副总裁斯科特•威尔逊(Scott Wilson)说,我们不知道。他们的另一单业务是:为一家多层次营销公司搜集
Facebook上的信息,这家公司想要"喜欢"该公司网页的用户的电邮地址--以及他们朋友的电邮地址--这样他们就能有针对性地进行产品推销。
Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the
scrapers, who try to outfox their defenses. Scraping itself isn't difficult: Nearly any
talentedcomputer programmer can do it. But penetrating a site's defenses can be tough.
信息搜集往往是网站和信息搜集公司之间猫捉耗子的游戏,前者努力保护数据,而后者则努力击破他们的防火 。信息搜集本身并不难:几乎任何有能力的计算机程序员都做得到。但穿透网站的防火 却可能很难。
One defense familiar to most Internet users involves 'captchas,' the squiggly letters that many websites require people to type to prove they're human and not a scraping robot. Scrapers sometimes fight back with software that deciphers captchas.
多数网络用户都很熟悉的一个防火 就是"验证码"(captchas),许多网站都要求人们键入这种歪歪扭扭的字符,以证明他们是人类,而不是信息搜集机器。信息搜集公司有时用能破译验证码的软件予以还击。
Some
professionalscrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they're targeting.
有些职业信息搜集者还会上演闪电突袭战,他们对一个网站同时发起十几次攻击,以尽快攫取尽可能多的数据,同时又不致被他们攻击的网站查到或使这家网站崩溃。
Raids like these are on the rise. 'Customers for whom we were
regularly blocking about 1,000 to 2,000
scrapes a month are now
seeing three times or in some cases 10 times as much scraping,' says Marino Zini, managing
director of Sentor Anti Scraping System. The company's Stockholm team blocks
scrapers on
behalf of website clients.
这种袭击有愈演愈烈之势。斯德哥尔摩的Sentor Anti Scraping System公司为网站客户提供拦截信息搜集行为的服务,该公司总经理齐尼(Marino Zini)说,我们以前通常每月为客户拦截1,000至2,000次信息搜集,而现在这一数字是原来的3倍,有些情况下甚至是10倍。
Julia Angwin / Steve Stecklow