Problem
I am having severe problems with performance on a WEB API helper method, the performance is simply dreadful, 1-2 connections / second, I am trying to make 40k API calls and 5 hours later it is still going.
I added and set the proxy property of the request to null as per other answers on SE but it doesn’t seem to help much, I have a feeling the issue might be that I am re creating the httpwebrequest for every iteration of the loop but I am not sure so posting here for some guidance and review
The two methods are as follows with the first one showing the loop and the second one being my DataFetcher helper that performs so slow
private async Task<List<String>> FetchDocumentsAndBuildList(string brand)
{
using (var client = new DocumentClient(new Uri(cosmosDBEndpointUrl), cosmosDBPrimaryKey))
{
List<string> formattedList = new List<string>();
FeedOptions queryOptions = new FeedOptions
{
MaxItemCount = -1,
PartitionKey = new PartitionKey(brand)
};
var query = client.CreateDocumentQuery<Document>(UriFactory.CreateDocumentCollectionUri(cosmosDBName, cosmosDBCollectionNameRawData), $"SELECT * from c where c.brand = '{brand}'", queryOptions).AsDocumentQuery();
while(query.HasMoreResults)
{
foreach (Document singleDocument in await query.ExecuteNextAsync<Document>())
{
JObject originalData = singleDocument.GetPropertyValue<JObject>("BasicData");
if(originalData != null)
{
var artNo = originalData.GetValue("artno");
if(artNo != null)
{
string strArtNo = artNo.ToString();
string productNumber = strArtNo.Substring(0, 7);
string colorNumber = strArtNo.Substring(7, 3);
string XXYYYUrl = $"https://www.xyz.com/{strArtNo}/en";
string XXXApiUrl = $"https://www.xyz.com/
string HttpFetchMethod = "GET";
JObject detailedDataResponse = await DataFetcher(XXXYYYUrl, HttpFetchMethod);
JObject inventoryData = await DataFetcher(XXXApiUrl, HttpFetchMethod);
if(detailedDataResponse != null)
{
JObject productList = (JObject)detailedDataResponse["product"];
if(productList != null)
{
var selectedIndex = productList["articlesList"].Select((x, index) => new { code = x.Value<string>("code"), Node = x, Index = index })
.Single(x => x.code == strArtNo)
.Index;
detailedDataResponse = (JObject)productList["articlesList"][selectedIndex];
}
}
singleDocument.SetPropertyValue("DetailedData", detailedDataResponse);
singleDocument.SetPropertyValue("InventoryData", inventoryData);
singleDocument.SetPropertyValue("consumer", "akqa");
}
}
formattedList.Add(Newtonsoft.Json.JsonConvert.SerializeObject(singleDocument));
}
}
return formattedList;
}
}
static public async Task<JObject> DataFetcher(string apiUrl, string fetchType)
{
try
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(apiUrl);
request.AutomaticDecompression = DecompressionMethods.GZip | DecompressionMethods.Deflate;
request.UserAgent = "Test";
request.ContentType = "application/json; charset=utf-8";
request.Method = fetchType;
request.Proxy = null;
using (HttpWebResponse response = (HttpWebResponse)await request.GetResponseAsync())
using (Stream stream = response.GetResponseStream())
using (StreamReader reader = new StreamReader(stream))
{
string apiReturnStr = await reader.ReadToEndAsync();
JObject apiReturnObjectJobject = JObject.Parse(apiReturnStr);
return apiReturnObjectJobject;
}
}
catch (WebException e)
{
JObject emptyObject = null;
return emptyObject;
}
}
— EDIT —
The Method is called in an azure function every morning at 5 AM, the CosmosDB query completes in less than 10 second and returns avg 40 000 documents, I then need to loop through each document and do two web api calls in order to get additional information and add that to the document and finally add the modified document to a string list to be used by the CosmosDB bulkImport method.
If I limit the query to 20 documents it finishes extremely fast, just a few seconds, if I do 100 documents its still extremely fast, I haven’t tested larger steps other then the full set after that.
The document I am fetching is 3-4 levels deep JSON object, and I am parsing an array on the second level in the document
— EDIT —
Solution
General Feedback
There is a lot of work going on in the FetchDocumentsAndBuildList
Method. Consider breaking it up into smaller methods that make the main method easier to read and isolate units of work. One example would be to move the entire contents of the foreach
into a method that could be named FetchDocumentAsync
.
In general, async method names should end with Async. This means that DataFetcher
would be named DataFetcherAsync
. See this answer for more information.
Performance Feedback
If performance is a real concern then you should consider profiling the application to see where the bottlenecks are. Even with a low volume of documents you should be able to get a general grasp of where the code is taking the longest. It is possible that performance issues only arise from a high document count in which case it would be best to profile in a similar scenario.
Some potential improvements
- The two
HttpWebRequest
s perDocument
are being run serially (that is, one must finish before the next one beings). Consider calling both DataFetcher calls and then later awaiting the results. This will cause both Http Requests to be sent and then later when the value is needed it will wait for a response. Theoretically doubling the speed. See this answer for a more detailed explanation.
var detailedDataResponseTask = DataFetcher(XXXYYYUrl, HttpFetchMethod);
var inventoryDataTask = DataFetcher(XXXApiUrl, HttpFetchMethod);
var detailedDataResponse = await detailedDataResponseTask
if(detailedDataResponse != null)
...
singleDocument.SetPropertyValue("InventoryData", await inventoryDataTask);
- A very small performance improvement could be done by taking
productList
code and removing the projection being done to get the index of an item.
// Original
var selectedIndex = productList["articlesList"]
.Select((x, index) => new { code = x.Value<string>("code"), Node = x, Index = index })
.Single(x => x.code == strArtNo)
.Index;
detailedDataResponse = (JObject)productList["articlesList"][selectedIndex];
// New
detailedDataResponse = productList["articlesList"].Single(x => x.Value<string>("code") == strArtNo);
- The final suggestion I have for performance, which probably needs some more careful planning to not have 80,000 tasks in progress, is to do the same optimization as before with
await
ing theHttpWebRequest
s. Each document can be gathered and processed independently of each other document. A naive example of how to do this would be:
var tasks = new List<Task<string>>();
foreach(var singleDocument in await query.ExecuteNextAsync<Document>())
tasks.Add(FetchDocumentAsync(singleDocument));
while(tasks.Count > 0)
{
Task<string> finishedTask = await Task.WhenAny(tasks);
tasks.Remove(finishedTask);
formattedList.Add(await finishedTask);
}