Hands On With Google Search’s Answer to ChatGPT

Last weekend, I turned to Google Search for help figuring out how many stamps I needed to put on an 8-ounce piece of mail. (Naturally, I was sending a copy of the latest issue of WIRED!). It’s the exact sort of question that I hoped Google Search’s new generative AI feature, which I’ve been testing for the past month, would solve much faster than I could through my own browsing.

Google’s clunkily named Search Generative Experience, SGE for short, infuses its search box with ChatGPT-like conversational functionality. You can sign up at Google’s Search Labs. The company says it wants users to converse with its search chatbot, which launched to testers in May, to dive deeper into topics and ask more challenging and intuitive questions than they would type into a boring old query box. And AI-generated answers are meant to organize information more clearly than a traditional search results page—for example, by pulling together information from multiple websites. Most of the world’s web searches run through Google, and it’s been developing AI technologies longer than most companies, so it’s fair to expect a top-notch experience.

So goes the theory. It turns out that in practice the new feature is so far more nuisance than aide. It’s slow, ineffective, verbose, and cluttered—more artificial interference than intelligence.

Once you gain access to Google’s test, the search box looks unchanged. But in response to a query like “How many stamps to mail 8 ounce letter,” a new section takes up a good chunk of the screen, pushing down the conventional list of links. Within that area, Google’s large language models generate a couple of paragraphs similar to what you might find from ChatGPT or Microsoft’s Bing Chat. Buttons at the bottom lead to a chatbot interface where you can ask follow-up questions.

The first thing I noticed about Google's vision for the future of search was its sluggishness. In tests where I controlled a stopwatch app with one hand and submitted a query with the other, it sometimes took nearly six seconds for Google’s text-generator to spit out its answer. The norm was more than three seconds, compared to no more than one second for Google’s conventional results to appear. Things could have been worse: I did my tests after Google rolled out an update which it claims doubled the search bot’s speed last month. Yet I still often find myself deep into reading the regular results by the time the generative AI finishes up, meaning I end up ignoring its tardily submitted dissertations. Cathy Edwards, a Google Search vice president, tells me speed optimizations of the AI software underpinning the tool are ongoing.

One could excuse the slowness of this new form of search if the results were worthwhile. But accuracy is spotty. Google’s five-sentence generative AI response to my stamps question included apparent errors of both multiplication and subtraction, stamp prices outdated by two years, and suggested follow-up questions that ignored crucial variables for shipping costs, such as shape, size, and destination. The disclaimer Google displays at the top of each AI-generated answer rang resoundingly true: “Generative AI is experimental. Info quality may vary.”

In the same response, Google’s new search feature suggested that I would need either $2.47 or $4 worth of stamps. Navigating to the US Postal Service’s online calculator provided the official answer: I needed $3.03, or five stamps at 66 cents each with a 27-cent overpayment. Google’s Edwards says my humble query pushed the technology’s current boundaries. “It's definitely on the frontier,” she says.

Unfortunately, dumbing down didn’t end well either. When asked for just the price of a stamp, Google responded with an outdated figure. Only specifying that I wanted the price as of this month got the system to correctly reflect this month’s 3-cent cost hike. To be fair, ChatGPT would flunk this query too because its training data cuts off in 2021—but it is not positioned as a replacement for a search engine.

Google's new search experience feels unreliable enough that I’m better off just clicking through standard results to conduct my own research. A query about Star Wars video games developed by gamemaker Electronic Arts generated an accurate list except for the inclusion of one title from EA rival Ubisoft. Ironically, the generative-AI description of the game in the result mentioned it was made by Ubisoft, demonstrating how large language models can contradict themselves.

When asked about players whom the San Diego Padres—which surely will beat Steven’s Phillies to a wild card spot—may try to acquire through a swap with another baseball team, Google’s AI response started with two players currently at the Padres, confusing trade chips as trade targets.

Google has put in some protective measures. The new search experience does not display for some health or financial queries, for which Google has placed a higher bar for accuracy. And the experience almost always prominently features links to related resources on the web to help users corroborate the AI outputs. Results on queries like “Write a poem” have the disclaimer “You may see inaccurate creative content.” And the AI system generally won’t try to sound too cute or adopt a persona. “We don't think people actually want to talk to Google,” Edwards says, drawing a contrast with Bing Chat, which is known to go into first-person speech or sprinkle emojis.

At times, Google’s new vision for search can feel more like a step back than a leap into the future. The generated answers can duplicate other features on the results page, such as featured snippets that draw a clear and digestible answer from a website or knowledge boxes that provide a paragraph-length overview of a topic from Wikipedia. When it belatedly chimes in on results like those, the generative AI version tends to be the most wordy and trickiest to make sense of.

Edwards mentioned at least eight times in our 30-minute discussion about my experiences with the new feature that it is still early in its development with plenty of kinks to iron out. “I don't think you're gonna hear me say that we have nailed this,” she says. “We're at the beginning of a 10-year-long arc of transformation.” She also says the feedback to date has been “super positive,” but perhaps most importantly, she says that what Google eventually launches to all users “might look quite different to where we are today.”

An experience that is speedier, less crammed with content, and able to help ship WIRED issues to readers without risking them getting returned for underpaid postage would be nice.

Time Travel

Google’s quest to pithily respond to users’ questions with direct answers began years ago. Back in 2016, then-WIRED writer Cade Metz wrote about how Google assembled about 100 linguistics PhDs fluent in about two dozen languages to condense writing and annotate sentences to help train AI systems to understand how human language works. Google expected the team and the technology to grow for years to come.

These "sentence compression algorithms" just went live on the desktop incarnation of the search engine. They handle a task that's pretty simple for humans but has traditionally been quite difficult for machines. They show how deep learning is advancing the art of natural language understanding, the ability to understand and respond to natural human speech. "You need to use neural networks—or at least that is the only way we have found to do it," Google research product manager David Orr says of the company's sentence compression work.

Google trains these neural networks using data handcrafted by a massive team of PhD linguists it calls Pygmalion. In effect, Google's machines learn how to extract relevant answers from long strings of text by watching humans do it—over and over again. These painstaking efforts show both the power and the limitations of deep learning. To train artificially intelligent systems like this, you need lots and lots of data that's been sifted by human intelligence. That kind of data doesn't come easy—or cheap. And the need for it isn't going away anytime soon.

But just a year later, Google researchers devised a new approach to training AI that made much of that prep unnecessary and led to the large language models that underlie services such as ChatGPT and the new Google Search. Looking back, I wouldn’t mind the crisp Google Search answer snippets of years past.

Ask Me One Thing

Jennifer Phoenix, via Facebook, asks why AI image generators continue to get hands and fingers wrong. “I read it’s because of complexity,” she says, “but I’d think the remedy is more training on those features.”

I’m with you, Jennifer. After reading your question, I tried generating images of “hand with a ring tattoo of setting sun” in a demo version of the AI tool Stable Diffusion. The batch of four results I got back featured disjointed, wobbly fingers and hands with missing digits, unnaturally slender wrists, or giant knuckles. In contrast, the query “face with cheek tattoo of setting sun” did result in some wild images, but at least the faces looked realistic.

AI generated image.

Stable Diffusion via Paresh Dave

Pranav Dixit did a deep dive for BuzzFeed News (RIP) earlier this year into the history of hands in art, and wrote that the fact that people’s hands are often busy—holding cups, for instance—can explain why AI systems struggle to recreate them realistically. The New Yorker’s Kyle Chayka also looked at the issue, pointing out that making more precise commands to AI image generators about what hands should be doing can help.

As you say, Jennifer, throwing better or more diverse data at AI systems should often result in more accurate outcomes. Some users spotted modest improvements in output of hands in “v5” of Midjourney’s AI generator earlier this year. But Midjourney CEO David Holz tells me by email that the company “didn't do anything specific for hands. Our stuff just works better in v5.”

On the other hand, Stable Diffusion’s developer Stability AI did work specifically on the hands problem while developing its newest version, which was released this week. Joe Penna, Stability’s head of applied machine learning, says poorly-generated hands was the top complaint from users. When I tried the new model with my hand-tattoo query, two images turned out well while the other two lacked some knuckles.

AI generated image.

Stable Diffusion via Paresh Dave

The new model has about eight times the capacity of its predecessor to learn visual patterns to reproduce, which essentially means it can remember more about how hands should look, Penna says. The company also gave it additional training on images of people and artwork, to reflect what users are most interested in. Now, says Penna, “it's remembering things like hands a lot more.”

Inserting millions of additional images of hands to the training data actually worsened generated images of hands, making them oversize, Penna says, but he says the company is testing different tactics to drive further improvement.

Before speaking to Penna, I hypothesized AI developers may want to avoid achieving perfection because imperfect hands are a common way to spot deepfakes. Penna says that wasn’t the case, but that Stability took other steps to make sure it’s obvious when images have been generated with its technology. "We're not going to go back to building worse hands, so let's start being very careful with the images that we see on the internet,” he says.

With bone structure fails starting to get settled, maybe next the companies can take on the fact that all 12 images I generated from my test prompts depicted fair-skin hands? I’ll leave explaining that to Steven in a future Plaintext.

You can submit questions to mail@wired.com. Write ASK LEVY in the subject line.

End Times Chronicle

Thought it couldn’t get more awful than Mountain Dew Flamin’ Hot soda? Try mustard-flavored Skittles candies, a gimmick for National Mustard Day in the US next week.

Last but Not Least

Futurama is back! But the first episode only made me laugh once (when a robot comedian called a room full of friends too PC). The show’s all about critiquing our modern tech-centric world. Unfortunately, it seems to be picking on easy targets.

The EU is preparing a massive database of all content moderation decisions by social media companies and the reasoning behind them.

The hottest new data feed in tech? Combat data from Ukraine to train military AI software.

Vigilante justice: A person with impaired vision who was scammed out of a laptop teamed up with a friend to take on the fraudster. The evidence is now with the police.

Get More From WIRED