{"id":2148,"date":"2022-10-08T12:25:05","date_gmt":"2022-10-08T10:25:05","guid":{"rendered":"https:\/\/akim.sissaoui.com\/en\/?p=2148"},"modified":"2022-10-08T20:01:59","modified_gmt":"2022-10-08T18:01:59","slug":"python-download-zip-and-parse-url-from-text-file","status":"publish","type":"post","link":"https:\/\/akim.sissaoui.com\/en\/informatique\/python-download-zip-and-parse-url-from-text-file\/","title":{"rendered":"Python: Download ZIP and parse URL from text file"},"content":{"rendered":"\n<p>Hey there. It&#8217;s been a while. I&#8217;m back on Home Assistant, and I&#8217;ve been trying some stuff.<\/p>\n\n\n\n<p>Along the way, I tested the Swiss Meteo integration and it was not quite working, so I looked how I could eventually fix it or make one myself. So far, I&#8217;m nowhere near the end, yet I thought this script below could help someone. <\/p>\n\n\n\n<p>To get the 10 min updates of Swiss Meteo, I need to download a ZIP file. In that ZIP file stand a text file, in which one can find the URL to download the CSV containing the 10 min update data. <\/p>\n\n\n\n<p>Ok&#8230; Bare with me: <\/p>\n\n\n\n<ol><li>Download a ZIP file<\/li><li>Uncompress the text file we need<\/li><li>Parse the URL to the CSV file<\/li><\/ol>\n\n\n\n<p> Let&#8217;s start by importing the required modules:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nfrom io import BytesIO\nfrom zipfile import ZipFile\nimport requests\nimport re\nimport sys\n<\/pre><\/div>\n\n\n<ul><li>ByteIO allows to load a binary file in memory rather than to need to write it on drive<\/li><li>zipfile has an obvious name&#8230; It allows to handle&#8230;. Yeeaaahhh, ZIP files&#8230;<\/li><li>requests will allow to make an https request, hence download the zip file<\/li><li>re to use the regex to extract the URL from the text file<\/li><li>and sys to handle error codes. Even if in this example, I don&#8217;t use any.<\/li><\/ul>\n\n\n\n<p>Next, we define variables. Quite self exlpainatory:<\/p>\n\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">url = \"https:\/\/whateverurlyouneed\"\n# Define string to be found in the file name to be extracted\nfilestr = \"anystring\"\n# Define string to be found in URL to parse\nurlstr = \"anyotherstring\"\n# Define regex to extract URL\nregularex = r\"(?i)\\b((?:https?:\/\/|www\\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}\/)(?:[^\\s()<>]+|(([^\\s()<>]+|(([^\\s()<>]+)))))+(?:(([^\\s()<>]+|(([^\\s()<>]+))))|[^\\s`!()[]{};:'\\\".,<>?\u00ab\u00bb\u201c\u201d\u2018\u2019]))\"<\/div><\/pre>\n\n\n\n<p>Now, let&#8217;s download the zip file and put it in a variable called &#8220;content&#8221;:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\ncontent = requests.get(url)\n<\/pre><\/div>\n\n\n<p>We then open the file in memory:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nzipfile = ZipFile(BytesIO(content.content))\n<\/pre><\/div>\n\n\n<p>From that stream, we retrieve the text file with the name containing the filestr variable, and we load it in the variable called data:<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\ndata = &#x5B;zipfile.open(file_name) for file_name in zipfile.namelist() if filestr in file_name]&#x5B;0]\n<\/pre><\/div>\n\n\n<p>Finally, we read line by line, and from each line, we try to extract a URL corresponding to the regex we entered above in regularex variable. This will find all URLs containing urlstr variable content.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: plain; title: ; notranslate\" title=\"\">\nfor line in (line for line in data.readlines() if urlstr in line.decode(&quot;latin-1&quot;)):\n    urls = re.findall(regularex,line.decode(&quot;latin-1&quot;))\n    print(&#x5B;url&#x5B;0] for url in urls])\n    break\n<\/pre><\/div>\n\n\n<p>And we exit the script with error code 0.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">\nsys.exit(0)\n<\/pre><\/div>\n\n\n<p>So the full script looks like:<\/p>\n\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: python; title: ; notranslate\" title=\"\">#!\/bin\/env python\nfrom io import BytesIO\nfrom zipfile import ZipFile\nimport requests\nimport re\nimport sys\n# define url value\nurl = \"https:\/\/whateverurlyouneed\"\n# Define string to be found in the file name to be extracted\nfilestr = \"anystring\"\n# Define string to be found in URL\nurlstr = \"anyotherstring\"\n# Define regex to extract URL\nregularex = r\"(?i)\\b((?:https?:\/\/|www\\d{0,3}[.]|[a-z0-9.-]+[.][a-z]{2,4}\/)(?:[^\\s()<>]+|(([^\\s()<>]+|(([^\\s()<>]+)))))+(?:(([^\\s()<>]+|(([^\\s()<>]+))))|[^\\s`!()[]{};:'\\\".,<>?\u00ab\u00bb\u201c\u201d\u2018\u2019]))\"\n# download zip file\ncontent = requests.get(url)\n# Open stream\nzipfile = ZipFile(BytesIO(content.content))\n# Open first file from the ZIP archive containing \n# the filestr string in the name\ndata = [zipfile.open(file_name) for file_name in zipfile.namelist() if filestr in file_name][0]\n# read lines from the file. If csv found, print URL and exit\n# This will return the 1st URL containing CSV in the opened file\nfor line in (line for line in data.readlines() if urlstr in line.decode(\"latin-1\")):\n    urls = re.findall(regularex,line.decode(\"latin-1\"))\n    print([url[0] for url in urls])\n    break\nsys.exit(0)<\/pre><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Hey there. It&#8217;s been a while. I&#8217;m back on Home Assistant, and I&#8217;ve been trying some stuff. Along the way, I tested the Swiss Meteo integration and it was not quite working, so I looked how I could eventually fix it or make one myself. So far, I&#8217;m nowhere near the end, yet I thought [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2157,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[220],"tags":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"https:\/\/akim.sissaoui.com\/wp-content\/uploads\/2022\/10\/python.jpg","_links":{"self":[{"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/posts\/2148"}],"collection":[{"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/comments?post=2148"}],"version-history":[{"count":20,"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/posts\/2148\/revisions"}],"predecessor-version":[{"id":2186,"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/posts\/2148\/revisions\/2186"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/media\/2157"}],"wp:attachment":[{"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/media?parent=2148"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/categories?post=2148"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/akim.sissaoui.com\/en\/wp-json\/wp\/v2\/tags?post=2148"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}