我正在爬到到网。我现在的问题是抓取给定酒店的 Hotelstars(不是平均用户评级 [bubbles],而是酒店等级评级),稍后我将遇到隐藏在“阅读更多”后面的评论问题。https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html幸运的是我知道在哪里可以找到这两个数据。它在该标签内的页面中:

<script window.__WEB_CONTEXT={pageManifest:{"assets":[.... 

在这里搜索https://pastebin.com/Ww3ugxFR https://pastebin.com/Ww3ugxFR因为“景色太棒了!!” (隐藏文本示例)或“星级”:酒店之星。


这是我的例子,说明它是如何不起作用的。我需要学习如何告诉 CSS 选择器(或其他工具)如何解决这个特定问题以及如何从中提取数据。在此示例中,我将仅加载响应并进行简单的模式搜索。我想人们也可以用 Json 加载它并从那里提取,但我还没有用 Json 来确定。:

hotel_CONTEXT = response.css("script text=window.__WEB_CONTEXT ::attr(pageManifest)).extract()

pattern_hotelstar = re.compile(r'star":\["\d')
matches_hotelstar = pattern_hotelstar.findall(hotel_CONTEXT)
Hotel_stars = str(matches_hotelstar).split('"')[2].split("'")[0]

显然我想要实现的目标可以通过 BeautifulSoup 实现(抓取隐藏在“阅读更多”下的数据的网站 https://stackoverflow.com/questions/56682371/scraping-a-website-with-data-hidden-under-read-more...但是我在尝试复制时遇到了 json 错误)但通常我更喜欢使用 Scrapy 的解决方案。

Andrej Kesely 提供了一个出色的解决方案我的问题!他的代码运行得非常好,我想完全理解它!这是我认为从代码中可以理解的内容,而我只是不明白他的魔法;):

data = re.search(r'window\.__WEB_CONTEXT__=(.*?});', html_text).group(1)

Andrej 在整个 html_text 中搜索以“window.__WEB...”开头的模式,以非贪婪方式 (?) 将模式扩展到所有字符 (.) 任意次数 (*),并以我不明白为什么有一个带有 } init 的捕获组,以及为什么 } 不只是放在最后,因为脚本以 }; 结尾。 (安德烈是如何发现这一点的?这是这些的一般模式还是他打印了整页并查找了它?)。我也不明白为什么它必须是非贪婪的。 Group(1) 选择了第一个括号内的所有内容,留下了窗口。WEB_CONTEXT= 出来。我猜这与使用 json 加载结果有关。同样适用于

data = data.replace('pageManifest', '"pageManifest"')   

然后 Andrej 创建一个名为 traverse 的函数,稍后将用数据的输出填充该函数。在 if 语句中,Andrej 检查输入是否是字典。在下一步中,Andrej 循环遍历字典的 key(k) 和 value(v)。如果 k==“评论”,他就会产生该值。如果不是“函数的产量”?我也迷失了 elif 和检查 val 是否是一个列表...一般来说,函数的输出 v 是什么?我将如何更改该函数以包含更多要滚动的字典,因为 else 已被此收益占据。

def traverse(val):
if isinstance(val, dict):
    for k, v in val.items():
        if k == 'reviews':
            yield v
            yield from traverse(v)
elif isinstance(val, list):
    for v in val:
        yield from traverse(v)

这里 Andrej 循环遍历 traverse(data)(字典,对吧?)。由于我们在此页面上收到了多条评论。 在嵌套循环中,Andrej 为单个评论中的每个字典指定名称 r,并通过 dictonary_name["key"] 检索存储的值。我对吗 ?

for reviews in traverse(data):
  for r in reviews:
    print('Rating:', r['rating'])
    print('-' * 80)



import re
import json
import requests

url = 'https://www.tripadvisor.com.ph/Hotel_Review-g8762949-d1085145-Reviews-El_Rio_y_Mar_Resort-San_Jose_Coron_Busuanga_Island_Palawan_Province_Mimaropa.html'
html_text = requests.get(url).text

data = re.search(r'window\.__WEB_CONTEXT__=(.*?});', html_text).group(1)
data = data.replace('pageManifest', '"pageManifest"')
data = json.loads(data)

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

def traverse(val):
    if isinstance(val, dict):
        for k, v in val.items():
            if k == 'reviews':
                yield v
                yield from traverse(v)
    elif isinstance(val, list):
        for v in val:
            yield from traverse(v)

for reviews in traverse(data):
    for r in reviews:
        print('Rating:', r['rating'])
        print('-' * 80)


Just WOW!!
Okay, I didn't know this resort would be mainly couples and honeymooners as I went with 2 friends. We weren't uncomfortable though and met lots of nice people from across the globe and 1 couple from the US. This resort can only be reached by boat, so it is very secluded. We stayed in bungalow #2. It was rustic, but beautiful and right on the beach. Everyone who worked in the resort was friendly and very accommodating. We ate most meals at the resort which was pretty good. We had happy hour at the pier bar every day which was from 4-7pm. They had half off certain drinks and food specials. It was very nice relaxing, enjoying a great drink and watching the sunset. You can snorkel right in front of the resort which was so cool! We snorkeled for 2 hours!! The best is right by the floating bungalows where they did massages. Speaking of massages....OMG! It was heaven!! Very affordable and different. When you lie face down, you look into a cut out in the floor, so you can view the water and fish swimming by. I loved it!! We did an island hopping tour and it was not an issue coming from this resort. When we got into Coron town and passed by all the hotels in that area, we were so glad and thankful we chose El Rio Y Mar. Coron Town is very dirty, dusty, full of young backpackers and the hotels look subpar. It's fine if you're on a budget. I get it, but us girls/mom/friends wanted to treat ourselves. That we did! One day we went on a guided hike to the top of a closeby mountain. The view was fantastic!! I highly recommend this resort and would definitely return.
Rating: 5
Amazing staff
The best customer experience we ever had! the school of fishes within the resort are amazing, very quite, very clean and well maintained rooms and outdoor surroundings. Our island trip organized by them is one of the best experience we had in our Coron trip. 
Kudos to El Rio highly recommended
Rating: 5

...and so on.

