1, running

Download youtube-dl with launch.json,

Github repo "Configurations ": [{"name": "audio", "type":" Python ", "request": "launch", "program": "${workspaceFolder}/youtube_dl", "console": "integratedTerminal", "args": ["-F", "http://www.himala.com/61425525/sound/47740352/"] } ]Copy the code

Then, call main.py directly

Through the main.py file, will

if __package__ is None and not hasattr(sys, 'frozen'):
   import os.path
   path = os.path.realpath(os.path.abspath(__file__))
   sys.path.insert(0, os.path.dirname(os.path.dirname(path)))



Copy the code

Replace with

import os.path
path = os.path.realpath(os.path.abspath(__file__))
sys.path.insert(0, os.path.dirname(os.path.dirname(path)))
Copy the code

2. Operating outline

2.1, program entry, take command line parameters, do things

In the __main__.py file,

if __name__ == '__main__':
   youtube_dl.main()
Copy the code

__init__.py file, go

def main(argv=None):
    try:
        _real_main(argv)
    except DownloadError:
        sys.exit(1)
    # ...
Copy the code

__init__.py file, then go

Def _real_main(argv=None): try: if opts.load_info_filename is not None: retcode = ydl.download_with_info_file(expand_path(opts.load_info_filename)) else: retcode = ydl.download(all_urls) except MaxDownloadsReached: #...Copy the code

2.2. YoutubeDL. Py file, take the URL to download audio and video

class YoutubeDL(object):


   def download(self, url_list):
        # ...
        for url in url_list:
             try:
                # It also downloads the videos
                res = self.extract_info(
                    url, force_generic_extractor=self.params.get('force_generic_extractor', False))
              except UnavailableVideoError:
       # ...
            
Copy the code

In this function, not only extract relevant information, but also download web pages, audio and video

Everything has been worked out

def extract_info(self, url, download=True, ie_key=None, extra_info={},
                     process=True, force_generic_extractor=False):

        if not ie_key and force_generic_extractor:
            ie_key = 'Generic'

        if ie_key:
            ies = [self.get_info_extractor(ie_key)]
        else:
            ies = self._ies

        for ie in ies:
            if not ie.suitable(url):
                continue
            ie = self.get_info_extractor(ie.ie_key())
        # ...
        try:
            ie_result = ie.extract(url)
        # ...
Copy the code

Ie, Info Extract in the code above

Youtube-dl can handle information from many sites, and each site has a corresponding Info Extract file

You know, YouTube — dl, you know, video

3, find IE

How does youtube-dl, given a URL, find the corresponding Internet Explorer

Match with the re

Youtube-dl implements the extensibility of site support through regex

3.1, in the code aboveself._iesInitialization of

3.1.1 self._iesadd

In the youtubedl.py file,

Self._ies, initialized entry

class YoutubeDL(object):
	def __init__(self, params=None, auto_init=True):
    	# ...
    	if auto_init:
            self.print_debug_header()
            self.add_default_info_extractors()
		# ...
Copy the code

Take the information from gen_Extractor_classes,

Added to the self. _ies

    def add_default_info_extractors(self):
        """
        Add the InfoExtractors returned by gen_extractors to the end of the list
        """
        for ie in gen_extractor_classes():
            self.add_info_extractor(ie)
            
            
            
    def add_info_extractor(self, ie):
        """Add an InfoExtractor object to the end of the list."""
        self._ies.append(ie)
        # ...
Copy the code
3.1.2 self._iesAdded content

In the __init__.py file,

_ALL_CLASSES adds all classes that end in IE that are referenced to extractors.py in the Extractor folder


#...
except ImportError:
    _LAZY_LOADER = False
    from .extractors import *
    _ALL_CLASSES = [
        klass
        for name, klass in globals().items()
        if name.endswith('IE') and name != 'GenericIE'
    ]
    _ALL_CLASSES.append(GenericIE)


def gen_extractor_classes():
    return _ALL_CLASSES
Copy the code

_ALL_CLASSES, the order of this list is important, it was matched by the re first,

It’s Internet Explorer

3.1.3, Adding a website

New support for a website, set up the corresponding IE file

In the extractors.py file, add the reference as follows

from .youtube import (
    YoutubeIE,
    YoutubeChannelIE,
    # ...
}
Copy the code

3.2. Find the corresponding IE

In the youtubedl.py file raised above,

def extract_info(self, url, download=True, ie_key=None, extra_info={}, process=True, force_generic_extractor=False): #... for ie in ies: if not ie.suitable(url): continue ie = self.get_info_extractor(ie.ie_key()) # ...Copy the code

Every IE has a class method def suitable(CLS, URL):

Each site’s IE inherits from

Class InfoExtractor(Object) in common.py

class InfoExtractor(object):

    @classmethod
    def suitable(cls, url):
   
        if '_VALID_URL_RE' not in cls.__dict__:
            cls._VALID_URL_RE = re.compile(cls._VALID_URL)
        return cls._VALID_URL_RE.match(url) is not None
Copy the code

If the Internet Explorer of this site doesn’t fulfill its suitable,

Just use the Suitable of the InfoExtractor class

Internet Explorer for every website

class XimalayaIE(XimalayaBaseIE): Github repo IE_NAME = 'himala' IE_DESC = 'Himala website' _VALID_URL = r' HTTPS? : / /? :www\.|m\.) ? himala\.com/(?P<uid>[0-9]+)/sound/(?P<id>[0-9]+)'Copy the code

The InfoExtractor class, with __dict__, gets the _VALID_URL property that we configured,

Let’s just reorganize it and recognize it

4. Find out the information on the web

Above code, youtubedl.py file, go to IE and do things

def extract_info(self, url, download=True, ie_key=None, extra_info={}, process=True, force_generic_extractor=False): #... try: ie_result = ie.extract(url) # ...Copy the code

Start with the InfoExtractor class in the common.py file

    def extract(self, url):
        """Extracts URL information and returns it in list of dicts."""
        try:
            for _ in range(2):
                try:
                    self.initialize()
                    ie_result = self._real_extract(url)
        # ...
Copy the code

Enter the himala.py file, the class that actually does the work

Download web pages, regular extraction of information

Github repo class HimalaIE(InfoExtractor): def _real_extract(self, url): #... webpage = self._download_webpage(url, audio_id, note='Download sound page for %s' % audio_id, errnote='Unable to get sound page') # ... if is_m: audio_description = self._html_search_regex(r'(? s)<section\s+class=["\']content[^>]+>(.+?) </section>', webpage, 'audio_description', fatal=False) else: audio_description = self._html_search_regex(r'(? s)<div\s+class=["\']rich_intro[^>]*>(.+? </article>)', webpage, 'audio_description', fatal=False) # ...Copy the code

5, the application

On Himala, one host’s work is many,

There is no internal work search function for the host

This can be achieved by simply extending youtube-dl

How many fallout episodes has this anchor had, and on which page

The code is very simple, see

github repo

related

Json in Python debug, for example youtube-dl

Python: Make, example is youtube-dl

For more entry level operations on blogs, see their blogs