Enterprise AI in Production: Luc Vincent, former VP of AI at Meta

Summarize with AI

October 5, 2023

5 mins

Back to Blogs

Data infrastructure for multimodal AI

Click around the platform to see the product in action.

Written by

Eric Landau

View more posts

As the founder of an AI tooling startup, I have the opportunity to spend a good portion of my time speaking with the engineers, scientists and business innovators working on cutting-edge AI applications in their respective fields.

From research institutions and small technology startups, to Fortune 500 companies and global consulting companies — time and time again, the two main questions I'm posed are 1) "How should we leverage AI in our business?" and, conversely, 2) "How do we get our AI applications in production?"

And despite the huge amount of headlines and 'glamour' around AI, the reality is that most teams today struggle with very similar, and mundane, challenges when trying to build AI applications.

So I thought, why not share the stories and examples of teams who have succeeded at getting AI applications out — how they overcame the challenges they faced and what they see ahead for the industry.

And who better to kick things off than Luc Vincent, core advisor at Encord and CPTO at Hayden AI. Luc is also the former Head of AI at Meta, as well as having built the Level 5 Autonomous Driving Division at Lyft and the founder of 'Street View' at Google.

Luc shares some of the legendary work he's been involved in throughout his career — from building Lyft's first autonomous vehicle organization, to Google's geo imagery division, and the metaverse at Meta.

______________

Another really interesting project you helped lead at Google is the acquisition of ReCAPTCHA in 2009. When I see something like that I think of Google getting lots of free data labeling for training their models. Was that a factor at all in motivating the acquisition?

The acquisition was originally being driven by the Google Books team. The hypothesis was that using a human-in-the-loop approach on OCR through reCAPTCHA, you'd present to a user two twisted letter words -- one being the word you need to get right, the other being pulled from a scanned newspaper and typically every crummy looking word, so very hard for OCR system to recognize at the time, which is why you'd ask the user. So you'd show that word to more than one user -- and then once you have 5 users agreeing on what that word was, then you could use that as the label.

But then I thought, well if we do it for words, we could apply that same human-in-the-loop approach to visual data and other modalities. I was working on the street view project at the time. The first thing we did was apply reCAPTCHA to house numbers. Pulled from street view images, we'd then get reCAPTCHA users out there to help us transcribe those and build our models, which helped us transcribe multiple billion addresses around the world, which themselves became useful to build maps and geocode -- because the way you geocode is to take your application like Google Maps and search an address that Maps returns to you because it's already seen that address before.

And then from there we expanded reCAPTCHA to the range of images that you now see.

You were at Google when the first deep learning revolution was unfolding, when Alexnet came out. Was it a sudden shift from these older heuristic methods, or was it gradual, was there pushback -- how did deep learning propagate internally?

There were some people that were believers -- typically the engineers that had been in that field for some time and knew it was something big -- and the skeptics. I was part of the broader geo team and one thing that we were working on and advocating for was building maps automatically from imagery. Imaging the image that you collect from street view or from aerial or satellite data -- that's ground truth, it's stuff you've seen, you've observed. So wouldn't it be awesome to take that and derive knowledge from this imagery and from the world, going much beyond house numbers to street names, lane configuration, recognizing businesses on the street and everything else. So that was the vision. But it was early days. We had to train a huge amount of data.

There were some skeptics. People who said Sure you can build a system to recognize house numbers but how about we just ask a user instead? There were competing forced. People who believed UGC (User Generated Content) was the answer for everything. And some like us we believed that a more nuanced approach and algorithmically heavy was going to win.

You also led the AV division at Lyft. And that's also one of the major pushbacks autonomous vehicles applications get -- that it is a black box, safety concerns etc. How would you generally respond to people's precautions around the fact you can have vehicles on the road that are driven by systems which are difficult to explian from a human perspective?

I came to Lyft to found the Lyft Level 5 division. The reason I thought that was useful was that - first of all, I believed a platform like Lyft would let you deploy AVs incrementally, which means without having to solve the massive challenge of ODD form the get-go. You could imagine AVs operating in the Lyft app alongside drivers, and initially AVs would be very limited in their routes, maybe do one route or 2 routes and eventually as their coverage increased they'd do more.

There was also the data piece. Lyft, like all these transportation services, had many cars on the road. And you could imagine having these vehicles collect data as part of their operations. So we felt 10,000s or more of Lyft vehicles on the network could be collecting data that could be used to for training these systems. So again that implies an AI system for self driving car, and an AI system that would be AI all the way.

When I joined Lyft in 2017, at the time there was a lot of excitement in self-driving applications. Many companies were making bold claims thinking they were almost ready. But they had solved perception. And for a long time people equated the complexity of self-driving with perception. And once that started to work well in multiple weather conditions, again thanks to deep learning, people thought the rest would be simply a matter of tweaking the system. But it turns out that perception is only one piece, and the most complex one is the decision-making logic that you need.

these systems that had been build very early on, were very much based on rules, so hard to generalize. And that's why many of these companies weren't able to launch their services until 4 years later, still at small scale. So AI was the key.

And going back to your question. We envisioned a full AI system, to be the 'driver'. Indeed, when you have one of these systems, and understanding why it made these decisions is very tricky. But the other approach of a rules-based system is equally hard to understand and debug. So ultimately it'll be about validation, verification, and being able to prove over very large datasets, that statistically you are safer than with a human driver. But that'll take time. We're just at the beginning. We're seeing good progress from Cruise and Waymo, and we're just at the early days.

And what does the data look like for training those kinds of systems where you have to make an action like a reinforcement learning a problem -- how do you set up a data pipeline that works and that talks to all the other components --the perception system, the hardware.

People have different approaches. What we tried to do at Lyft level 5 was to come up with a data representation that was bitmap based. And there's a lot of reinforcement learning. But it comes down to what is the representation that you learn on and how do you measure performance? The problem is that for any situation, there's many good paths through the data, so even if you observe human behavior -- the human may have done a safe thing but maybe it wasn't the absolute best thing to do -- so it leads to very big challenges on how to train on this. These were some of the things we had to address before eventually being acquired by Toyota.

Do you have an opinion of when autonomous vehicles will be more widely used in the public -- say 10% of the vehicles on the road are self-driving cars?

I don't know what I'd say the percentage or time is specifically but I think we're still quite far off. I think it's possible we'll get in a situation where ride-sharing services will be mostly robotaxis plus some human drivers to complement them, on routes that are still too hard or unmapped for robots. I don't know that self-driving is especially suited for personal vehicles. I'd say at least 5 years for commercial applications and then another 5 for 10% of a subset of cars on the road be self-driving.

After the division was acquired by Toyota. What were you working on at Meta.

After the acquisition was over, I got approached by Meta and got interested in the space of AR in particular. What struck me was an analogy to self-driving cars actually. AR glasses they're a bit like self-driving cars -- they are ego-centric sensing platforms, they're also multimodal. You use the egocentric sensor data to help a user do interesting thing. In case of self-driving cars, you help user go from A to B. Here, you help users do a bunch of things throughout their day -- connect them, protect them, empower them. All while being present. I thought the challenges were similar - maybe even more complex - but the difference was also the safety aspect, you can't really hurt anyone with the glasses.

So I worked on various multimodal AI systems to help understand context. And then help the user in the moment while being present. So the glasses, once mature, you're able to do with them things you can do on your phone to the point you can be present in the moment.

The form factor is highly constrained -- you have to cram a huge amount of tech in the glasses, from compute to batteries to RAM to a bunch of sensors and what have you and do it in a way that still keeps them looking good and being light enough. That's one. The other thing is the data. To train AI models to do things you need data. And that data does not exist. We need to be able to understand the user's context -- are they inside, outside, are they cooking, running, and from that context help them with what it is that they might need help in the moment.

But that data doesn't exist because the glasses don't exist at the moment. So we need to bootstrap this in many ways. One thing that we have done is our research team has started a project called Ego 40 -- a consortium of 12 or more universities around the world that are collaborating on ego-centric multimodal perception, they're sponsoring data collection, they're creating various context and ways to evaluate various algorithms, and data is actually annotated by Meta. They've produced a very large amount of data. It's just camera data, visual data. These glasses have many other sensors. Ideally you use all the sensors you have to IMU, to trigger the experience that you want. and in order to do that we've created other glasses -- Project Aria -- they are not a commercial or end-user product, they're a data collection device. they're crammed with sensors -- camera, to eye-tracking cameras, to microphone. We have a bunch of volunteers -- around 1500 within Meta -- that are using them for targeted collection tasks. We may tell them we need more videos of people cooking, can you wear the glasses while cooking and record 30 mins or 1 hour of data which is used and annotated to train models. The last challenge is how do you interact with them. They don't have a keyboard, or a mouse or a touch screen. So we think 1) Voice -- we use voice assistance but it may be awkward in some social situations to talk to your glasses 2) EMG - electromiography , essentially reading electrical signals produced by. We acquired a company called ControlLabs that is working on this 3) But the biggest opportunity is multi-modal, build models that can understand your context. The models need to be multi-output, multi-task.

It strikes me that one of the themes of your career is lever ways of getting data acquisition -- using reCAPTCHA to get people to label your data, then using Lyft fleet to get a lot of visual data, then now using Meta glasses as another way to get data. So the key is to use smart ways to get data to train really interesting models.

What do you think about Generative AI and ChatGPT and all the things that have come out over the last few months.

It's been exciting to watch and to see what happens. I think OpenAI has certainly opened many people's eyes as to the potential for that tech. Is ChatGPT a real product yet I'm not totally clear. It's really fun and exciting to try. We're seeing some overr

We're seeing overreaction from many of the big tech players now to come up to a strategy. It's lit a fire under many companies -- big tech as well as other startups, that are trying to latch onto the GenAI craze and raise funds and launch new products. We're seeing a huge amount of hype right now and it's likely going to die down a bit. Like with every cycle, there's a big big jump and trough and then it'll

Evaluate your models and build active learning pipelines with Encord

Try Encord today