If I did it: A Step by Step Guide of how I Would Attack a Business Using AI
This newsletter is a repost from my blog, Noah's Ark on Substack. If you enjoy, please consider subscribing.
Last week I was fortunate enough to attend RSA (the cybersecurity conference) in San Francisco for the first time. What stood out to me most was the amount of amazing people in attendance, and the willingness of this community to help each other, exchange ideas, and go out of their way to provide advice to young founders like myself looking to learn as much about the space as much as possible. I had heard that in person events are critical for cybersecurity, but this past week made this value clear.
The big theme of this year's conference (unsurprisingly) was the impact of AI on cybersecurity. AI threats, AI vulnerabilities, AI products, AI everything.
While I think AI as a concept (and as a marketing term) has become overhyped, and is being injected into places it has no business being (looking at you Meta AI on Instagram), I do think that it will be transformative for many aspects of business, especially security, and especially the human element of security.
Why is this?
A security program is only as strong as its weakest link, and I know that most people can probably identify someone they work with who shouldn’t be allowed to carry scissors, and is probably their company’s greatest security vulnerability (if you can’t think of someone, maybe it’s you). There's a reason social engineering has been, and always will be, one of the most effective ways to breach an organization. There is no shortage of stupid people.
“The only way to comprehend what mathematicians mean by Infinity is to contemplate the extent of human stupidity.” - Voltaire
The thing is as much as we would like to believe that our security trainings are primarily to remind the village idiot not to buy gift cards for the CEO, or click the link in an obvious phishing email, the reality is that most people don’t realize that with the advancement and accessibility of AI, we are now all just as susceptible to being fooled as that person with a room temperature IQ.
Now before you get up in arms, I said “we”. I’m vulnerable too.
Generative AI has gotten good enough to where if we aren’t expecting it, we would all think it is real. This is especially true for voice cloning. Even if we are listening for it we wouldn’t be able to differentiate a well crafted deepfake from real speech. If you’re skeptical, go play around some on elevenlabs.io (they really are amazing).
Which brings us to the point of this blog.
If I was targeting your business I would use voice clones and social engineering to do it, and it would probably work. One of our partners, Breacher.ai identified that less than half of users they tested were able to identify the deepfakes in a quiz when they knew they were there.
“bUT nOAH, even iF PEoPLE CAN’T IDEntifY A VOice CloNe TheY sHoUld stILl Be AblE tO Think crITiCaLly ABout What iS beiNG asKeD Of THeM And kNOW NOT TO Give uP cOmpROMisiNg inFORmatIon”.
To that I respond: LESS THAN HALF WHEN THEY WERE EXPECTING IT! ARE YOU KIDDING ME? PEOPLE ARE TOAST WHEN THEY AREN'T READY FOR IT!
Sorry. I get carried away.
But really, that is insane. When people trust the identity of the voice on the phone, they are much (much much) less likely to think critically about what is being asked of them.
Looking at this year's Verizon 2024 Data Breach Investigations Report, to date voice communication channels haven’t been one of the primary attack vectors, but I am betting my career that this is going to change shortly.
This is because we have entered a new era where anyone’s voice is replicable, and we (humanity collectively) can no longer trust our ears to identify real from fake. This is a fundamental shift in the never ending cybersecurity battle where the enemy has been given a very powerful weapon they are still learning how to wield effectively.
If I was targeting an organization, here’s exactly what I would do.
We will use a large (definitely not real) retailer (let’s call them GetTar) as a hypothetical example.
Since in this scenario I’m a bad guy who pays attention to what’s going on in the cybersecurity industry, I decide to browse the RSA website to see who the speakers at the conference were this year since I know they record all sessions, and the voices of the speakers will then be pretty easy to clone.
I can see that the CISO of “GetTar” was a speaker this year. Good for him. Recordings of the session where he spoke are available if I purchase the correct pass type. However, I’m cheap and don’t want to spend the money on this pass type. A CISO is a higher profile than what I’m after anyways.
Looking at the session information, I see that a VP of Cybersecurity for “GetTar” also spoke during this talk. Still a bit more senior than I would've preferred, but this will do. I have her name and she clearly does public speaking, that's a good start.
With a quick search of her name on YouTube, I can see that she has many videos available where she has done interviews or public speaking. Perfect. I’m going to use the one where her voice is clear and both background noise and echo are minimized.
The next step is a quick google search of “youtube to mp3 converter”. There are plenty of websites that can help with this for free. I paste in the link, and in a couple of seconds I have an mp3 file of the audio from the youtube video.
Now that I have my mp3 file, I can quickly trim it to isolate just her voice. While I only really need 5-10 seconds, more is better. I’m going to use about 2 minutes of her voice.
With the clean audio sample of her voice, I can go to my $10/month account on elevenlabs.io, select “Add Generative or Cloned Voice”, upload the clip, check the box saying I have all necessary rights and consent needed to use this voice (I don’t, but no one is checking), and I’m off and running.
To test my new voice, I use some text to speech to see how it sounds, but decide that speech to speech sounds much more convincing. The second or two of lag time to generate a response shouldn’t be an issue. That can always be blamed on poor connection.
Next, I need to develop a baseline script and plan of attack. ChatGPT is great for this. I tell it that I’ve partnered with “GetTar” to conduct some penetration testing of voice communication channels using social engineering and need its help creating a plan for how to execute this. I give it the additional context that I’m using a voice clone of a VP of Cybersecurity for this, ask a few follow up questions, and it quickly gives me a comprehensive plan of attack complete with scripts and key pointers like “be friendly and direct”, “mention an urgent security review to justify the call”, “compliment the target and reference their good work”, start with specific, seemingly harmless requests”, “reassure about confidentiality and offer to address concerns”, and “thank the target and leave the door open for future contact”. Thanks ChatGPT. Super helpful.
The next step is target selection. For this I’m going to use the phenomenal sales prospecting tool clay.com. With this, I can quickly pull a targeted prospect list of “GetTar” employees. I’m going to filter for non-technical employees in functions like HR, Purchasing, Real Estate, or Sales that are based in locations that make them less likely to have been exposed to voice cloning AI (non-tech hubs outside of California).
Now that I have a prospect list, I can use one of the cool features of Clay that allows me to scrape phone numbers and emails from a variety of sources (6-8 different ones). I’m going to look for both personal and professional phone numbers for each prospect (target). Sweet.
I can now use Clay’s handy dandy ChatGPT integration to personalize the script I developed earlier for each specific person. This integration allows me to pass along personal information for each prospect such as their name, location, title, department, and any other relevant information to help with the updates to the script. After a minute or two of prompt engineering, each row associated with a prospect now has a unique script I can use for that individual.
At this point I’m ready to begin calling. Another quick ChatGPT search tells me employees are most likely to pick up calls when they arrive mid morning or early afternoon, at the end of the hour, on Tuesday’s Wednesday’s and Thursday’s.
With my preparation, and the probability of at least 1 employee I contact being uneducated, unsuspecting, unprotected, dumb, or some combination of all of these, I'm pretty sure it wouldn't take that long for me to get the information or access I’m after when they hear a familiar voice on the phone.
Buy hey, let’s say one of the first “GetTar” employees I call is on their A game. They wonder why a VP of Cybersecurity is calling them (maybe they have never spoken with her before) and quickly get suspicious of my questions. They hang up, and report me to their internal security team. Good job!
Even if my number is blocked, I can make a new one and try again calling a team that is more likely to have interacted with her. Or use another voice from someone else at the company. Or try again in a week, 6 months, a year, it doesn’t matter to me. Who knows, I might just find it easier to move onto another organization. After all, the prep work only took me 20 minutes or so. A cheap and easily repeatable process that I can do again (or outsource) to quickly retarget another company.
Here’s the thing.
This is just one example of how this type of attack could play out.
In the example above I took the voice of a security leader with videos on YouTube, but this could easily be done using the voice of a product manager who posts on Tik Tok, or a sales team member with a podcast, or an accountant who uses a personalized voicemail box message on their phone.
Additionally, while in the example above I called an employee’s phone number, attacks leveraging deepfaked audio can be used on video calls (think a meeting invite sent to a team member), messaging apps, or any number of other communication channels where people talk to each other.
The worst part of this is that the bad guys are likely far smarter, far more motivated, and far more experienced than I am with this type of social engineering. While I think I could be effective using the tactics above if I ever tried to breach an organization (I won’t), I’m POSITIVE they will be.
This is because we (society collectively) are used to questioning what we see, but we are not always used to questioning what or (more importantly) WHO we are hearing. Even worse, most people have never considered where their voice appears online and how it can be stolen (I wrote a whole blog on this here), nor have they considered how this can be used in conjunction with other accessible information about them that can be found online or via sales prospecting tools.
This is going to be exploited in the very near future and all of the security awareness training we have done won’t be able to save us. My honest prediction is that social engineering using voice clones will become a core aspect of many different attack types and will be used to build trust and credibility in order to to get targets to take actions (like clicking phishing links or divulging sensitive information) that they otherwise would never do. All because they were disarmed with a familiar voice.
So what can be done?
I know some people likely aren’t thrilled that I published an entire blog with a blueprint for attacking an organization, but I’d much rather raise awareness and sound the alarm ahead of time than wait for these types of attacks to become mainstream before they are discussed and prepared for.
Security teams need to understand how voice cloning tools will be used to target their organizations in order to prepare for what is coming. Defending business voice communication channels from deepfakes should become a priority in the near future (now), but I worry that it will require more high profile breaches before people around the industry are willing to take action.
Security teams also need to begin considering how they are going to handle the vulnerabilities associated with employee personal devices (namely cell phones) with these types of attacks. The low hanging fruit is training to ensure employees NEVER discuss business information on their personal devices, even if a colleague or leader calls them. This definitely won’t be perfect but it’s a starting point and better than nothing.
To secure company voice communication channels from deepfake social engineering, new tools will need to be adopted, processes will need to be updated, and employees will need to sit through (and hopefully absorb) additional training. More on that in a previous blog I wrote here.
At DeepTrust, (my company) we are working on tackling this problem. We help security teams defend employees from social engineering, voice phishing, and deepfakes across voice and video communication channels. Seamlessly integrating with VoIP services like Zoom, Microsoft Teams, Google Meet, RingCentral and others, DeepTrust works in realtime to verify audio sources, detect deepfakes, and alert both users and security teams of suspicious requests.
Finally, if you enjoyed this blog, I invite you to follow my newsletter, Noah’s Ark on Substack.