Most efficient way to batch delete S3 Files Announcing the arrival of Valued Associate #679: Cesar Manara Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern) Come Celebrate our 10 Year Anniversary!Deleting files from S3 subdirectoriesMost efficient way to copy between S3 accountsHow to batch edit a list of files?Most cost efficient way to backup Subversion data to S3?Moving files with batch files from one pc to a server, to a another pc - worried about disk corruptionbatch script to copy files as within SMBCIFS protocolMost bandwidth efficient setup for video sharingWhat is the most efficient way to transfer files from AWS S3 to S3?Logstash S3 input plugin re-scanning all bucket objectsBackup strategy for user uploaded filesProxy for a local mirror of S3 directories
How do I automatically answer y in bash script?
How to politely respond to generic emails requesting a PhD/job in my lab? Without wasting too much time
Why is there no army of Iron-Mans in the MCU?
New Order #5: where Fibonacci and Beatty meet at Wythoff
Why don't the Weasley twins use magic outside of school if the Trace can only find the location of spells cast?
Classification of bundles, Postnikov towers, obstruction theory, local coefficients
Does a C shift expression have unsigned type? Why would Splint warn about a right-shift?
Can a monk deflect thrown melee weapons?
Should you tell Jews they are breaking a commandment?
Stars Make Stars
Fishing simulator
Slither Like a Snake
What computer would be fastest for Mathematica Home Edition?
What is the electric potential inside a point charge?
What did Darwin mean by 'squib' here?
Complexity of many constant time steps with occasional logarithmic steps
What are the performance impacts of 'functional' Rust?
Can I throw a sword that doesn't have the Thrown property at someone?
Limit for e and 1/e
Need a suitable toxic chemical for a murder plot in my novel
How to colour the US map with Yellow, Green, Red and Blue to minimize the number of states with the colour of Green
Is there a documented rationale why the House Ways and Means chairman can demand tax info?
Is there folklore associating late breastfeeding with low intelligence and/or gullibility?
Do we know why communications with Beresheet and NASA were lost during the attempted landing of the Moon lander?
Most efficient way to batch delete S3 Files
Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 17/18, 2019 at 00:00UTC (8:00pm US/Eastern)
Come Celebrate our 10 Year Anniversary!Deleting files from S3 subdirectoriesMost efficient way to copy between S3 accountsHow to batch edit a list of files?Most cost efficient way to backup Subversion data to S3?Moving files with batch files from one pc to a server, to a another pc - worried about disk corruptionbatch script to copy files as within SMBCIFS protocolMost bandwidth efficient setup for video sharingWhat is the most efficient way to transfer files from AWS S3 to S3?Logstash S3 input plugin re-scanning all bucket objectsBackup strategy for user uploaded filesProxy for a local mirror of S3 directories
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty height:90px;width:728px;box-sizing:border-box;
I'd like to be able to batch delete thousands or tens of thousands of files at a time on S3. Each file would be anywhere from 1MB to 50MB. Naturally, I don't want the user (or my server) to be waiting while the files are in the process of being deleted. Hence, the questions:
- How does S3 handle file deletion, especially when deleting large numbers of files?
- Is there an efficient way to do this and make AWS do most of the work? By efficient, I mean by making the least number of requests to S3 and taking the least amount of time using the least amount of resources on my servers.
amazon-s3 batch-processing
add a comment |
I'd like to be able to batch delete thousands or tens of thousands of files at a time on S3. Each file would be anywhere from 1MB to 50MB. Naturally, I don't want the user (or my server) to be waiting while the files are in the process of being deleted. Hence, the questions:
- How does S3 handle file deletion, especially when deleting large numbers of files?
- Is there an efficient way to do this and make AWS do most of the work? By efficient, I mean by making the least number of requests to S3 and taking the least amount of time using the least amount of resources on my servers.
amazon-s3 batch-processing
add a comment |
I'd like to be able to batch delete thousands or tens of thousands of files at a time on S3. Each file would be anywhere from 1MB to 50MB. Naturally, I don't want the user (or my server) to be waiting while the files are in the process of being deleted. Hence, the questions:
- How does S3 handle file deletion, especially when deleting large numbers of files?
- Is there an efficient way to do this and make AWS do most of the work? By efficient, I mean by making the least number of requests to S3 and taking the least amount of time using the least amount of resources on my servers.
amazon-s3 batch-processing
I'd like to be able to batch delete thousands or tens of thousands of files at a time on S3. Each file would be anywhere from 1MB to 50MB. Naturally, I don't want the user (or my server) to be waiting while the files are in the process of being deleted. Hence, the questions:
- How does S3 handle file deletion, especially when deleting large numbers of files?
- Is there an efficient way to do this and make AWS do most of the work? By efficient, I mean by making the least number of requests to S3 and taking the least amount of time using the least amount of resources on my servers.
amazon-s3 batch-processing
amazon-s3 batch-processing
edited Apr 2 '15 at 4:58
tpml7
3301421
3301421
asked Apr 2 '15 at 4:06
SudoKillSudoKill
48114
48114
add a comment |
add a comment |
5 Answers
5
active
oldest
votes
AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).
The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!
I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).
If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.
If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.
Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.
add a comment |
The excruciatingly slow option is s3 rm --recursive
if you actually like waiting.
Running parallel s3 rm --recursive
with differing --include
patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include
pattern matching.
Enter bulk deletion.
I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects
.
Here's an example:
cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "Key=%s," "$@")],Quiet=true"' _
- The
-P8
option onxargs
controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time. - The
-n1000
option tellsxargs
to bundle 1000 keys for eachaws s3api delete-objects
call. - Removing
,Quiet=true
or changing it tofalse
will spew out server responses. - Note: There's an easily missed
_
at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.
But how do you get file-of-keys
?
If you already have your list of keys, good for you. Job complete.
If not, here's one way I guess:
aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys
6
Great approach, but I found that listing the keys was the bottleneck. This is much faster:aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys
And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion):tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr 'n' '' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "Key=%q," "$@")],Quiet=true"' _
– SEK
Aug 13 '18 at 18:09
You probably should also have stressed the importance on_
in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is thatbash -c
passes all arguments as positional parameters, starting with$0
, while "$@" only processes parameters starting with$1
. So the underscore dummy is needed to fill the position of$0
.
– Vlad Nikiforov
Oct 1 '18 at 12:42
@VladNikiforov Cheers, edited.
– antak
Oct 2 '18 at 1:30
One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've usedsplit -l 1000
to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.
– joelittlejohn
Apr 3 at 12:32
add a comment |
I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:
aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files
For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux
or screen
session and check back later.
2
It looks like theaws s3 rm --recursive
command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk
– Brandon
Feb 22 '18 at 4:35
add a comment |
A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
New contributor
add a comment |
Without knowing how you're managing the s3 buckets, this may or may not be particularly useful.
The AWS CLI tools has an option called "sync" which can be particularly effective to ensure s3 has the correct objects. If you, or your users, are managing S3 from a local filesystem, you may be able to save a ton of work determining which objects need to be deleted by using the CLI tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "2"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f679989%2fmost-efficient-way-to-batch-delete-s3-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).
The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!
I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).
If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.
If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.
Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.
add a comment |
AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).
The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!
I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).
If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.
If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.
Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.
add a comment |
AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).
The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!
I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).
If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.
If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.
Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.
AWS supports bulk deletion of up to 1000 objects per request using the S3 REST API and its various wrappers. This method assumes you know the S3 object keys you want to remove (that is, it's not designed to handle something like a retention policy, files that are over a certain size, etc).
The S3 REST API can specify up to 1000 files to be deleted in a single request, which is must quicker than making individual requests. Remember, each request is an HTTP (thus TCP) request. So each request carries overhead. You just need to know the objects' keys and create an HTTP request (or use an wrapper in your language of choice). AWS provides great information on this feature and its usage. Just choose the method you're most comfortable with!
I'm assuming your use case involves end users specifying a number of specific files to delete at once. Rather than initiating a task such as "purge all objects that refer to picture files" or "purge all files older than a certain date" (which I believe is easy to configure separately in S3).
If so, you'll know the keys that you need to delete. It also means the user will like more real time feedback about whether their file was deleted successfully or not. References to exact keys are supposed to be very quick, since S3 was designed to scale efficiently despite handling an extremely large amount of data.
If not, you can look into asynchronous API calls. You can read a bit about how they'd work in general from this blog post or search for how to do it in the language of your choice. This would allow the deletion request to take up its own thread, and the rest of the code can execute without making a user wait. Or, you could offload the request to a queue . . . But both of these options needlessly complicate either your code (asynchronous code can be annoying) or your environment (you'd need a service/daemon/container/server to handle the queue. So I'd avoid this scenario if possible.
Edit: I don't have the reputation to post more than 2 links. But you can see Amazon's comments on request rate and performance here: http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html And the s3 faq comments that bulk deleiton is the way to go if possible.
answered Apr 2 '15 at 19:27
Ed D'AzzoEd D'Azzo
1061
1061
add a comment |
add a comment |
The excruciatingly slow option is s3 rm --recursive
if you actually like waiting.
Running parallel s3 rm --recursive
with differing --include
patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include
pattern matching.
Enter bulk deletion.
I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects
.
Here's an example:
cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "Key=%s," "$@")],Quiet=true"' _
- The
-P8
option onxargs
controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time. - The
-n1000
option tellsxargs
to bundle 1000 keys for eachaws s3api delete-objects
call. - Removing
,Quiet=true
or changing it tofalse
will spew out server responses. - Note: There's an easily missed
_
at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.
But how do you get file-of-keys
?
If you already have your list of keys, good for you. Job complete.
If not, here's one way I guess:
aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys
6
Great approach, but I found that listing the keys was the bottleneck. This is much faster:aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys
And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion):tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr 'n' '' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "Key=%q," "$@")],Quiet=true"' _
– SEK
Aug 13 '18 at 18:09
You probably should also have stressed the importance on_
in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is thatbash -c
passes all arguments as positional parameters, starting with$0
, while "$@" only processes parameters starting with$1
. So the underscore dummy is needed to fill the position of$0
.
– Vlad Nikiforov
Oct 1 '18 at 12:42
@VladNikiforov Cheers, edited.
– antak
Oct 2 '18 at 1:30
One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've usedsplit -l 1000
to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.
– joelittlejohn
Apr 3 at 12:32
add a comment |
The excruciatingly slow option is s3 rm --recursive
if you actually like waiting.
Running parallel s3 rm --recursive
with differing --include
patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include
pattern matching.
Enter bulk deletion.
I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects
.
Here's an example:
cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "Key=%s," "$@")],Quiet=true"' _
- The
-P8
option onxargs
controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time. - The
-n1000
option tellsxargs
to bundle 1000 keys for eachaws s3api delete-objects
call. - Removing
,Quiet=true
or changing it tofalse
will spew out server responses. - Note: There's an easily missed
_
at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.
But how do you get file-of-keys
?
If you already have your list of keys, good for you. Job complete.
If not, here's one way I guess:
aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys
6
Great approach, but I found that listing the keys was the bottleneck. This is much faster:aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys
And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion):tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr 'n' '' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "Key=%q," "$@")],Quiet=true"' _
– SEK
Aug 13 '18 at 18:09
You probably should also have stressed the importance on_
in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is thatbash -c
passes all arguments as positional parameters, starting with$0
, while "$@" only processes parameters starting with$1
. So the underscore dummy is needed to fill the position of$0
.
– Vlad Nikiforov
Oct 1 '18 at 12:42
@VladNikiforov Cheers, edited.
– antak
Oct 2 '18 at 1:30
One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've usedsplit -l 1000
to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.
– joelittlejohn
Apr 3 at 12:32
add a comment |
The excruciatingly slow option is s3 rm --recursive
if you actually like waiting.
Running parallel s3 rm --recursive
with differing --include
patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include
pattern matching.
Enter bulk deletion.
I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects
.
Here's an example:
cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "Key=%s," "$@")],Quiet=true"' _
- The
-P8
option onxargs
controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time. - The
-n1000
option tellsxargs
to bundle 1000 keys for eachaws s3api delete-objects
call. - Removing
,Quiet=true
or changing it tofalse
will spew out server responses. - Note: There's an easily missed
_
at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.
But how do you get file-of-keys
?
If you already have your list of keys, good for you. Job complete.
If not, here's one way I guess:
aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys
The excruciatingly slow option is s3 rm --recursive
if you actually like waiting.
Running parallel s3 rm --recursive
with differing --include
patterns is slightly faster but a lot of time is still spent waiting, as each process individually fetches the entire key list in order to locally perform the --include
pattern matching.
Enter bulk deletion.
I found I was able to get the most speed by deleting 1000 keys at a time using aws s3api delete-objects
.
Here's an example:
cat file-of-keys | xargs -P8 -n1000 bash -c 'aws s3api delete-objects --bucket MY_BUCKET_NAME --delete "Objects=[$(printf "Key=%s," "$@")],Quiet=true"' _
- The
-P8
option onxargs
controls the parallelism. It's eight in this case, meaning 8 instances of 1000 deletions at a time. - The
-n1000
option tellsxargs
to bundle 1000 keys for eachaws s3api delete-objects
call. - Removing
,Quiet=true
or changing it tofalse
will spew out server responses. - Note: There's an easily missed
_
at the end of that command line. @VladNikiforov posted an excellent commentary of what it's for in the comment so I'm going to just link to that.
But how do you get file-of-keys
?
If you already have your list of keys, good for you. Job complete.
If not, here's one way I guess:
aws s3 ls "s3://MY_BUCKET_NAME/SOME_SUB_DIR" | sed -nre "s|[0-9-]+ [0-9:]+ +[0-9]+ |SOME_SUB_DIR|p" >file-of-keys
edited Oct 2 '18 at 1:27
answered Jun 22 '18 at 6:38
antakantak
18915
18915
6
Great approach, but I found that listing the keys was the bottleneck. This is much faster:aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys
And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion):tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr 'n' '' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "Key=%q," "$@")],Quiet=true"' _
– SEK
Aug 13 '18 at 18:09
You probably should also have stressed the importance on_
in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is thatbash -c
passes all arguments as positional parameters, starting with$0
, while "$@" only processes parameters starting with$1
. So the underscore dummy is needed to fill the position of$0
.
– Vlad Nikiforov
Oct 1 '18 at 12:42
@VladNikiforov Cheers, edited.
– antak
Oct 2 '18 at 1:30
One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've usedsplit -l 1000
to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.
– joelittlejohn
Apr 3 at 12:32
add a comment |
6
Great approach, but I found that listing the keys was the bottleneck. This is much faster:aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys
And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion):tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr 'n' '' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "Key=%q," "$@")],Quiet=true"' _
– SEK
Aug 13 '18 at 18:09
You probably should also have stressed the importance on_
in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is thatbash -c
passes all arguments as positional parameters, starting with$0
, while "$@" only processes parameters starting with$1
. So the underscore dummy is needed to fill the position of$0
.
– Vlad Nikiforov
Oct 1 '18 at 12:42
@VladNikiforov Cheers, edited.
– antak
Oct 2 '18 at 1:30
One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've usedsplit -l 1000
to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.
– joelittlejohn
Apr 3 at 12:32
6
6
Great approach, but I found that listing the keys was the bottleneck. This is much faster:
aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys
And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion): tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr 'n' '' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "Key=%q," "$@")],Quiet=true"' _
– SEK
Aug 13 '18 at 18:09
Great approach, but I found that listing the keys was the bottleneck. This is much faster:
aws s3api list-objects --output text --bucket BUCKET --query 'Contents[].[Key]' | pv -l > BUCKET.keys
And then removing objects (this was sufficient that going over 1 parallel process reaches the rate limits for object deletion): tail -n+0 BUCKET.keys | pv -l | grep -v -e "'" | tr 'n' '' | xargs -0 -P1 -n1000 bash -c 'aws s3api delete-objects --bucket BUCKET --delete "Objects=[$(printf "Key=%q," "$@")],Quiet=true"' _
– SEK
Aug 13 '18 at 18:09
You probably should also have stressed the importance on
_
in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is that bash -c
passes all arguments as positional parameters, starting with $0
, while "$@" only processes parameters starting with $1
. So the underscore dummy is needed to fill the position of $0
.– Vlad Nikiforov
Oct 1 '18 at 12:42
You probably should also have stressed the importance on
_
in the end :) I missed it and then it took me quite a while to understand why the first element gets skipped. The point is that bash -c
passes all arguments as positional parameters, starting with $0
, while "$@" only processes parameters starting with $1
. So the underscore dummy is needed to fill the position of $0
.– Vlad Nikiforov
Oct 1 '18 at 12:42
@VladNikiforov Cheers, edited.
– antak
Oct 2 '18 at 1:30
@VladNikiforov Cheers, edited.
– antak
Oct 2 '18 at 1:30
One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've used
split -l 1000
to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.– joelittlejohn
Apr 3 at 12:32
One problem I've found with this approach (either from antak or Vlad) is that it's not easily resumable if there's an error. If you are deleting a lot keys (10M in my case) you may have a network error, or throttling error, that breaks this. So to improve this, I've used
split -l 1000
to split my keys file into 1000 key batches. Now for each file I can issue the delete command then delete the file. If anything goes wrong, I can continue.– joelittlejohn
Apr 3 at 12:32
add a comment |
I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:
aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files
For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux
or screen
session and check back later.
2
It looks like theaws s3 rm --recursive
command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk
– Brandon
Feb 22 '18 at 4:35
add a comment |
I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:
aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files
For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux
or screen
session and check back later.
2
It looks like theaws s3 rm --recursive
command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk
– Brandon
Feb 22 '18 at 4:35
add a comment |
I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:
aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files
For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux
or screen
session and check back later.
I was frustrated by the performance of the web console for this task. I found that the AWS CLI command does this well. For example:
aws s3 rm --recursive s3://my-bucket-name/huge-directory-full-of-files
For a large file hierarchy, this may take some considerable amount of time. You can set this running in a tmux
or screen
session and check back later.
answered Aug 9 '17 at 19:01
dannymandannyman
286312
286312
2
It looks like theaws s3 rm --recursive
command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk
– Brandon
Feb 22 '18 at 4:35
add a comment |
2
It looks like theaws s3 rm --recursive
command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk
– Brandon
Feb 22 '18 at 4:35
2
2
It looks like the
aws s3 rm --recursive
command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk– Brandon
Feb 22 '18 at 4:35
It looks like the
aws s3 rm --recursive
command deletes files individually. Although faster than the web console, when deleting lots of files, it could be much faster if it deleted in bulk– Brandon
Feb 22 '18 at 4:35
add a comment |
A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
New contributor
add a comment |
A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
New contributor
add a comment |
A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
New contributor
A neat trick is using lifecycle rules to handle the delete for you. You can queue a rule to delete the prefix or objects that you want and Amazon will just take care of the deletion.
https://docs.aws.amazon.com/AmazonS3/latest/user-guide/create-lifecycle.html
New contributor
New contributor
answered Apr 9 at 20:59
cam8001cam8001
1112
1112
New contributor
New contributor
add a comment |
add a comment |
Without knowing how you're managing the s3 buckets, this may or may not be particularly useful.
The AWS CLI tools has an option called "sync" which can be particularly effective to ensure s3 has the correct objects. If you, or your users, are managing S3 from a local filesystem, you may be able to save a ton of work determining which objects need to be deleted by using the CLI tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
add a comment |
Without knowing how you're managing the s3 buckets, this may or may not be particularly useful.
The AWS CLI tools has an option called "sync" which can be particularly effective to ensure s3 has the correct objects. If you, or your users, are managing S3 from a local filesystem, you may be able to save a ton of work determining which objects need to be deleted by using the CLI tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
add a comment |
Without knowing how you're managing the s3 buckets, this may or may not be particularly useful.
The AWS CLI tools has an option called "sync" which can be particularly effective to ensure s3 has the correct objects. If you, or your users, are managing S3 from a local filesystem, you may be able to save a ton of work determining which objects need to be deleted by using the CLI tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
Without knowing how you're managing the s3 buckets, this may or may not be particularly useful.
The AWS CLI tools has an option called "sync" which can be particularly effective to ensure s3 has the correct objects. If you, or your users, are managing S3 from a local filesystem, you may be able to save a ton of work determining which objects need to be deleted by using the CLI tools.
http://docs.aws.amazon.com/cli/latest/reference/s3/sync.html
answered Apr 2 '15 at 19:42
Bill BBill B
411
411
add a comment |
add a comment |
Thanks for contributing an answer to Server Fault!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fserverfault.com%2fquestions%2f679989%2fmost-efficient-way-to-batch-delete-s3-files%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown